SlideShare a Scribd company logo
1 of 45
Download to read offline
Game Theory
Q learning
RL- Objectives
• State (s) - The place where the user agent is right now.
Examples:
1. A position on a chess board
2. A potential customer in a sales web
• Action (a)- An action that a user can take while he is in a state.
Examples:
1. Knight pawn captures bishop
2. The user buys a ticket
• Reward (r) - The reward that is obtained due to the action
Examples:
1. A better worse position
2. More money or more clicks
Basic Concepts
• Policy (π)- The “strategy” in which the agent decides which action to
take. Abstractly speaking the policy is simply a probability function that is
defined for each state
• Episode – A sequence of states and their actions
The Objective:
Predicting the expected future reward given the current state (s) :
1. Which actions should we take in order to maximize our gain
2. Which actions should we take in order to maximize the click rate
Value function (𝑉π (𝑠) ) – The expected reward of the episode given a policy
π and a state 𝑠
=>Bellman Equation
Bellman Equation
Optimal Function
We wish to find the optimal policy:
𝑉∗(s) = max
π
𝑉π(𝑠)
Our aim:
1. Finding the optimal policy
2. Finding the value of this policy for each state.
Dynamic Programing
𝑉(𝑠𝑡) =𝐸(𝑟𝑡+1 + γ𝑉(𝑠𝑡+1)│𝑠𝑡 = 𝑠)
• Learning is based on perfect model of the environment
(assuming Markov decision model )
• W𝑒 𝑢𝑝𝑑𝑎𝑡𝑒 𝑉𝜋
(𝑠𝑡) upon values of 𝑉𝜋
(𝑠𝑡+1). Where both of values are
estimated : bootstrapping
• The navigation follows mostly the ε-greedy methods
• Policy improvement
Estimating 𝑅𝑡 using Monte-Carlo
We estimate the value function by averaging the
obtained rewards. These methods are used in episodic
tasks.
We need to complete the episode for updating the value
function and learning is not “step by step” but episode by
episode.
Monte Carlo (Cont)
• Algorithm:
1. Choose policy
2. Run episode upon the policy .
In each step get the reward 𝑟𝑡 and add to the list Returns (𝑠𝑡)
3. V(s)=average(Returns (𝑠)) (Denoted as 𝑅𝑡)
The basic formula
𝑉 𝑠𝑡 = 𝑉 𝑠𝑡 +α[𝑅𝑡- 𝑉 𝑠𝑡 ]
Summary
Summary
1. Dynamic programming- bootstrapping : They learn
estimators upon other estimators.
Assumptions on the model
2. Monte Carlo –Learning from the experience between
episodes. The target for update is 𝑅𝑡
TD Methods
• The motivation is to combine sampling with bootstrapping, (DP with MC). The simplest formula
TD(0)
𝑉 𝑠𝑡 = 𝑉 𝑠𝑡 +α[𝑟𝑡 + γ𝑉(𝑠𝑡+1)- 𝑉 𝑠𝑡 ]
We do both bootstrapping (using estimated values of V) and sampling (using 𝑟𝑡)
TD->DP - Advantages
• No models of the environment (rewardprobabilities)
TD->MC – Advantages
• They can work online
• Independent in the length of the episode
• They don’t need to ignore or discount episodes with high experimental options since they learn from every
visit
• Convergence is guaranteed
The function -Q
• We discussed value functions V which are defined on
the episode states
• In some applications we wish to learn these functions
on the state-action space (s, a).
• The analysis for DP & MC is similar (you can check)
SARSA
• We learn each state upon the sequence:
S->A->R->S->A :
Q 𝑠𝑡, 𝑎𝑡 = 𝑄 𝑠𝑡, 𝑎𝑡 +α[𝑟𝑡 + γ𝑄(𝑠𝑡+1, 𝑎𝑡+1)- Q 𝑠𝑡, 𝑎𝑡 ]
We use the actions of time t+1 to learn action at time t
• We learn the 𝑄π
online and use the visits that it produced to update.
The policy goes toward greediness.
Off Policy- QLearning
• The basic formula:
Q 𝑠𝑡, 𝑎𝑡 = 𝑄 𝑠𝑡, 𝑎𝑡 +α[𝑟𝑡 + γ max
𝑎
𝑄(𝑠𝑡+1, 𝑎)- Q 𝑠𝑡, 𝑎𝑡 ]
• The idea is that we update the previous state not upon the current
state (it does not follow the policy) but upon the optimal state
• The pick of the option works in the same way, namely it follows
ε -greedy.
Game Theory
• What is Game?
1. Two or more players
2. Strategies
3. Rewards
Game theory Examples:
• Zero-Sum
Dominant moves
Dominant moves
•Both players will play 4!!!
Nash Equilibrium
• If f payoff function and 𝑥1 , 𝑥2, … . , 𝑥𝑛 strategies. Then 𝒙𝒊
∗
is an
equilibrium if for every player i and every strategy 𝒙𝒊
f𝑖 (𝒙𝒊
∗
, 𝑥−𝑖) > f𝑖 (𝒙𝒊, 𝑥−𝑖)
Existence Theorem
Every game with finite number of players has a Nash equilibrium in
mixed strategies.
Prisoner’s Dilemma
(c,c)
Battle of the Sexes
(s,s) (b,b) (s,b)
Impartial Games
• Games in which in each state both players can take the same actions
and get the same rewards
Chess ,Go?
No ! In these games every player has his own pieces
What is the difference in impartial games between the players?
• Nothing! simply one player starts and the other does not !
• Both players have complete information
Impartial Games (Cont.)
Basic Concepts
Normal Games - The last player to play wins
Misere Games - The last player to play loses
NP- Positions
N-position - The set of states in which the first player wins
P-position - The set of states in which the second player wins
A word on terminology: We assume that each state is a starting state, thus the first
player is the Next player and the second player is the Previous player
Sprague Grundy Function
• Let G the graph of the game.
1. The followers of N-positions are P-Positions
2. The followers of P-positions are N-Positions
The Sum of Games
If 𝐺1 = 𝑉1 , 𝐸1 , 𝐺2 = 𝑉2 , 𝐸2 are games then Their Sum is:
G={𝑉1 × 𝑉2 ,E} and if 𝑣1 ,𝑤1 ∊ 𝑉1 , 𝑣2,𝑤2 ∊ 𝑉2
E={(𝑣1𝑣2, 𝑤1𝑣2) }∪{(𝑣1𝑣2,𝑣1𝑤2) }
Sprague Grundy Function
What Can we say about the sum?
P+P->P
N+P->N
N+N->?
We should do better in order to know the game!
The mex Function
• Let N the integers and S s.t. S⊊ N
mex(S)= min(N/ S)
• Let the graph: G={𝑉,E} we define a function
g: V-> N
g (v) = mex{g(w) |(v,w) ∊E}
• g (v) =0 ⇔ v is P-position
• For 𝐺1, 𝐺2 𝑠𝑢b − games and G their sum we have
g (v)=g(𝑣1) ⊕ g(𝑣2)
Impartial Games (Cont.)
Sprague Grundy Theorem
Every normal impartial game is isomorphic to Nim !!
(Hence Nim is not only a game but a class of games)
What is Nim?
• A two players game originally from China.
• It has been played in Europe since the 16th century
• 1901 Bouton gave the game its name and..
A nice mathematical study.
How do we play?
• Two players
• N heaps of items where the number of items for heap i is 𝑘𝑖
• In each turn a player should pick an amount of items :
1. He must pick!
2. He can pick only from a single heap!
• Winning Strategy –Simply xor the heap sizes (By induction)
Examples
• A single heap with M items .
Who is going to win?
• Two heaps one of K items and one of 1..
• Two heaps one of M and one of K (M >K)
The Three Heaps Example
• We have three heaps 3,4,5 items
Let’s Play
(1,4,5)
(1,2,5)
(1,2,3)
(1,2,2)
(0,2,2)
…
Nim & QLearning
• There is a natural similarity between games and reinforcement:
1. A collections of states S
2. For each s ∊ S There are potential actions that a player can take.
3. Every step provides a reward
QLearning for Nim-Design & implementation
We build a Nim engine for three heaps (Eric Jarleberg)
State
• A set of actions that are available
• A Boolean variable that indicates whether this is a goal (true (0,0,0))
• A function that performs the XOR
• A function that performs an action and thus jumps to another state
Actions
• A number that indicates the heap in which we perfrom the action
• A number that indicates the amount of items
QLearning for Nim-Design & implementation
Agent
Agent Types
1. Random Agent- Simply gets the state and randomly picks an action
2. Optimal Agent –The one that plays the optimal strategy
3. Qlearning Agent – An agent that learns to play through Q functions
Agent Implementation
1. A function that given a state recommends an action for the agent
2. A function that provides a feedback
QLearning for Nim-Design & implem.
• What is our objective?
1. We wish to learn as fast as possible
2. We wish to have an accurate Q function
Remarks
1. The learning process is sensitive to opponents
2. We may obtain an optimal policy and not having good Q
QLearning for Nim
• Loss function:
We learn a policy π and we have for each game state s :
𝑈π 𝑠 = 𝑖=0
∞
γ𝑖
R(𝑆𝑖)
• Originally we have for each 𝑠 its 𝑈 and we estimate the learning by
minimizing :
│𝑈π -𝑈∗│ (i.e. min 𝐿1 ) .
How to Estimate?
1. 𝐿1 -Simply minimize │𝑈π -𝑈∗│
2. For every state there is only one “right action”, We can measure
thus the amount of failures rather the utility function. What we do
is counting the wrong actions for each s ∈ N . (Due to remark 2)
Method 2
Scenarios
1 Q-agent-> Optimal Player
2 Q-agent -> Q-agent
3 Q-agent -> Random
Q-Optimal
Q-Q
Q-Random
The Q Function
Remarks
1. The most important factor is the opponent
2. The Q function maps positive values to states in N
i.e.: Learning to play is equivalent to classify states to N & P
Corollary: We can use Qlearning to any impartial game
3. An interesting step can be to perform actor-critic and compare
policies through KL-divergence.
Good Sources
• https://github.com/jamesbroadhead/Axelrod/blob/master/axelrod/strategies/qlearner.py
• https://github.com/NimQ/NimRL
• http://www.csc.kth.se/utbildning/kth/kurser/ -RL for Nim Erik Jarelberg
• https://papers.nips.cc/paper/2171-reinforcement-learning-to-play-an-optimal-nash-equilibrium-in-team-markov-
games.pdf
• https://www.researchgate.net/profile/J_Enrique_Agudo/publication/221174538_Reinforcement_Learning_for_the_N-
Persons_Iterated_Prisoners'_Dilemma/links/5535ec0c0cf268fd0015f0ac/Reinforcement-Learning-for-the-N-Persons-
Iterated-Prisoners-Dilemma.pdf
• Gabriel Nivasch –Ariel University
• Sutton & Barto
End!!!!

More Related Content

What's hot

Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning Melaku Eneayehu
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-LearningLyft
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)pauldix
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Hogeon Seo
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processHamed Abdi
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningDongmin Lee
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益 黃
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationbutest
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)Dong Guo
 

What's hot (20)

Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
Reinforcement Learning using OpenAI Gym
Reinforcement Learning using OpenAI GymReinforcement Learning using OpenAI Gym
Reinforcement Learning using OpenAI Gym
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
Poster - black background
Poster -  black backgroundPoster -  black background
Poster - black background
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
AI Lecture 3 (solving problems by searching)
AI Lecture 3 (solving problems by searching)AI Lecture 3 (solving problems by searching)
AI Lecture 3 (solving problems by searching)
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement Learning
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 

Similar to Finalver

Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
 
DRL #2-3 - Multi-Armed Bandits .pptx.pdf
DRL #2-3 - Multi-Armed Bandits .pptx.pdfDRL #2-3 - Multi-Armed Bandits .pptx.pdf
DRL #2-3 - Multi-Armed Bandits .pptx.pdfGulamSarwar31
 
An Analytical Study of Puzzle Selection Strategies for the ESP Game
An Analytical Study of Puzzle Selection Strategies for the ESP GameAn Analytical Study of Puzzle Selection Strategies for the ESP Game
An Analytical Study of Puzzle Selection Strategies for the ESP GameAcademia Sinica
 
AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Sho...
AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Sho...AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Sho...
AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Sho...Joonhyung Lee
 
2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptxZhiwuGuo1
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningElias Hasnat
 
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]Shakeeb Ahmad Mohammad Mukhtar
 
Should a football team run or pass? A linear programming approach to game theory
Should a football team run or pass? A linear programming approach to game theoryShould a football team run or pass? A linear programming approach to game theory
Should a football team run or pass? A linear programming approach to game theoryLaura Albert
 
Machine Learning - Reinforcement Learning
Machine Learning - Reinforcement LearningMachine Learning - Reinforcement Learning
Machine Learning - Reinforcement LearningJY Chun
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAminaRepo
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptxManiMaran230751
 

Similar to Finalver (20)

Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Deep einforcement learning
Deep einforcement learningDeep einforcement learning
Deep einforcement learning
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
DRL #2-3 - Multi-Armed Bandits .pptx.pdf
DRL #2-3 - Multi-Armed Bandits .pptx.pdfDRL #2-3 - Multi-Armed Bandits .pptx.pdf
DRL #2-3 - Multi-Armed Bandits .pptx.pdf
 
An Analytical Study of Puzzle Selection Strategies for the ESP Game
An Analytical Study of Puzzle Selection Strategies for the ESP GameAn Analytical Study of Puzzle Selection Strategies for the ESP Game
An Analytical Study of Puzzle Selection Strategies for the ESP Game
 
AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Sho...
AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Sho...AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Sho...
AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Sho...
 
2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx
 
Module_3_1.pptx
Module_3_1.pptxModule_3_1.pptx
Module_3_1.pptx
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement-Learning.ppt
Reinforcement-Learning.pptReinforcement-Learning.ppt
Reinforcement-Learning.ppt
 
CS799_FinalReport
CS799_FinalReportCS799_FinalReport
CS799_FinalReport
 
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
 
Should a football team run or pass? A linear programming approach to game theory
Should a football team run or pass? A linear programming approach to game theoryShould a football team run or pass? A linear programming approach to game theory
Should a football team run or pass? A linear programming approach to game theory
 
Goprez sg
Goprez  sgGoprez  sg
Goprez sg
 
Machine Learning - Reinforcement Learning
Machine Learning - Reinforcement LearningMachine Learning - Reinforcement Learning
Machine Learning - Reinforcement Learning
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement Learning
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
RL.ppt
RL.pptRL.ppt
RL.ppt
 

More from Natan Katz

AI for PM.pptx
AI for PM.pptxAI for PM.pptx
AI for PM.pptxNatan Katz
 
SGLD Berlin ML GROUP
SGLD Berlin ML GROUPSGLD Berlin ML GROUP
SGLD Berlin ML GROUPNatan Katz
 
Ancestry, Anecdotes & Avanan -DL for Amateurs
Ancestry, Anecdotes & Avanan -DL for Amateurs Ancestry, Anecdotes & Avanan -DL for Amateurs
Ancestry, Anecdotes & Avanan -DL for Amateurs Natan Katz
 
Foundation of KL Divergence
Foundation of KL DivergenceFoundation of KL Divergence
Foundation of KL DivergenceNatan Katz
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural NetworksNatan Katz
 
Deep VI with_beta_likelihood
Deep VI with_beta_likelihoodDeep VI with_beta_likelihood
Deep VI with_beta_likelihoodNatan Katz
 
NICE Research -Variational inference project
NICE Research -Variational inference projectNICE Research -Variational inference project
NICE Research -Variational inference projectNatan Katz
 
NICE Implementations of Variational Inference
NICE Implementations of Variational Inference NICE Implementations of Variational Inference
NICE Implementations of Variational Inference Natan Katz
 
Variational inference
Variational inference  Variational inference
Variational inference Natan Katz
 
GAN for Bayesian Inference objectives
GAN for Bayesian Inference objectivesGAN for Bayesian Inference objectives
GAN for Bayesian Inference objectivesNatan Katz
 

More from Natan Katz (16)

final_v.pptx
final_v.pptxfinal_v.pptx
final_v.pptx
 
AI for PM.pptx
AI for PM.pptxAI for PM.pptx
AI for PM.pptx
 
SGLD Berlin ML GROUP
SGLD Berlin ML GROUPSGLD Berlin ML GROUP
SGLD Berlin ML GROUP
 
Ancestry, Anecdotes & Avanan -DL for Amateurs
Ancestry, Anecdotes & Avanan -DL for Amateurs Ancestry, Anecdotes & Avanan -DL for Amateurs
Ancestry, Anecdotes & Avanan -DL for Amateurs
 
Cyn meetup
Cyn meetupCyn meetup
Cyn meetup
 
Foundation of KL Divergence
Foundation of KL DivergenceFoundation of KL Divergence
Foundation of KL Divergence
 
Quant2a
Quant2aQuant2a
Quant2a
 
Bismark
BismarkBismark
Bismark
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural Networks
 
Deep VI with_beta_likelihood
Deep VI with_beta_likelihoodDeep VI with_beta_likelihood
Deep VI with_beta_likelihood
 
NICE Research -Variational inference project
NICE Research -Variational inference projectNICE Research -Variational inference project
NICE Research -Variational inference project
 
NICE Implementations of Variational Inference
NICE Implementations of Variational Inference NICE Implementations of Variational Inference
NICE Implementations of Variational Inference
 
Ucb
UcbUcb
Ucb
 
Neural ODE
Neural ODENeural ODE
Neural ODE
 
Variational inference
Variational inference  Variational inference
Variational inference
 
GAN for Bayesian Inference objectives
GAN for Bayesian Inference objectivesGAN for Bayesian Inference objectives
GAN for Bayesian Inference objectives
 

Recently uploaded

Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfWildaNurAmalia2
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsHajira Mahmood
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptxSulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptxnoordubaliya2003
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 

Recently uploaded (20)

Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutions
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptxSulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 

Finalver

  • 2. RL- Objectives • State (s) - The place where the user agent is right now. Examples: 1. A position on a chess board 2. A potential customer in a sales web • Action (a)- An action that a user can take while he is in a state. Examples: 1. Knight pawn captures bishop 2. The user buys a ticket • Reward (r) - The reward that is obtained due to the action Examples: 1. A better worse position 2. More money or more clicks
  • 3. Basic Concepts • Policy (π)- The “strategy” in which the agent decides which action to take. Abstractly speaking the policy is simply a probability function that is defined for each state • Episode – A sequence of states and their actions The Objective: Predicting the expected future reward given the current state (s) : 1. Which actions should we take in order to maximize our gain 2. Which actions should we take in order to maximize the click rate Value function (𝑉π (𝑠) ) – The expected reward of the episode given a policy π and a state 𝑠 =>Bellman Equation
  • 5. Optimal Function We wish to find the optimal policy: 𝑉∗(s) = max π 𝑉π(𝑠) Our aim: 1. Finding the optimal policy 2. Finding the value of this policy for each state.
  • 6. Dynamic Programing 𝑉(𝑠𝑡) =𝐸(𝑟𝑡+1 + γ𝑉(𝑠𝑡+1)│𝑠𝑡 = 𝑠) • Learning is based on perfect model of the environment (assuming Markov decision model ) • W𝑒 𝑢𝑝𝑑𝑎𝑡𝑒 𝑉𝜋 (𝑠𝑡) upon values of 𝑉𝜋 (𝑠𝑡+1). Where both of values are estimated : bootstrapping • The navigation follows mostly the ε-greedy methods • Policy improvement
  • 7. Estimating 𝑅𝑡 using Monte-Carlo We estimate the value function by averaging the obtained rewards. These methods are used in episodic tasks. We need to complete the episode for updating the value function and learning is not “step by step” but episode by episode.
  • 8. Monte Carlo (Cont) • Algorithm: 1. Choose policy 2. Run episode upon the policy . In each step get the reward 𝑟𝑡 and add to the list Returns (𝑠𝑡) 3. V(s)=average(Returns (𝑠)) (Denoted as 𝑅𝑡) The basic formula 𝑉 𝑠𝑡 = 𝑉 𝑠𝑡 +α[𝑅𝑡- 𝑉 𝑠𝑡 ]
  • 9. Summary Summary 1. Dynamic programming- bootstrapping : They learn estimators upon other estimators. Assumptions on the model 2. Monte Carlo –Learning from the experience between episodes. The target for update is 𝑅𝑡
  • 10. TD Methods • The motivation is to combine sampling with bootstrapping, (DP with MC). The simplest formula TD(0) 𝑉 𝑠𝑡 = 𝑉 𝑠𝑡 +α[𝑟𝑡 + γ𝑉(𝑠𝑡+1)- 𝑉 𝑠𝑡 ] We do both bootstrapping (using estimated values of V) and sampling (using 𝑟𝑡) TD->DP - Advantages • No models of the environment (rewardprobabilities) TD->MC – Advantages • They can work online • Independent in the length of the episode • They don’t need to ignore or discount episodes with high experimental options since they learn from every visit • Convergence is guaranteed
  • 11. The function -Q • We discussed value functions V which are defined on the episode states • In some applications we wish to learn these functions on the state-action space (s, a). • The analysis for DP & MC is similar (you can check)
  • 12. SARSA • We learn each state upon the sequence: S->A->R->S->A : Q 𝑠𝑡, 𝑎𝑡 = 𝑄 𝑠𝑡, 𝑎𝑡 +α[𝑟𝑡 + γ𝑄(𝑠𝑡+1, 𝑎𝑡+1)- Q 𝑠𝑡, 𝑎𝑡 ] We use the actions of time t+1 to learn action at time t • We learn the 𝑄π online and use the visits that it produced to update. The policy goes toward greediness.
  • 13. Off Policy- QLearning • The basic formula: Q 𝑠𝑡, 𝑎𝑡 = 𝑄 𝑠𝑡, 𝑎𝑡 +α[𝑟𝑡 + γ max 𝑎 𝑄(𝑠𝑡+1, 𝑎)- Q 𝑠𝑡, 𝑎𝑡 ] • The idea is that we update the previous state not upon the current state (it does not follow the policy) but upon the optimal state • The pick of the option works in the same way, namely it follows ε -greedy.
  • 14. Game Theory • What is Game? 1. Two or more players 2. Strategies 3. Rewards
  • 18. Nash Equilibrium • If f payoff function and 𝑥1 , 𝑥2, … . , 𝑥𝑛 strategies. Then 𝒙𝒊 ∗ is an equilibrium if for every player i and every strategy 𝒙𝒊 f𝑖 (𝒙𝒊 ∗ , 𝑥−𝑖) > f𝑖 (𝒙𝒊, 𝑥−𝑖) Existence Theorem Every game with finite number of players has a Nash equilibrium in mixed strategies.
  • 20. Battle of the Sexes (s,s) (b,b) (s,b)
  • 21. Impartial Games • Games in which in each state both players can take the same actions and get the same rewards Chess ,Go? No ! In these games every player has his own pieces What is the difference in impartial games between the players? • Nothing! simply one player starts and the other does not ! • Both players have complete information
  • 22. Impartial Games (Cont.) Basic Concepts Normal Games - The last player to play wins Misere Games - The last player to play loses NP- Positions N-position - The set of states in which the first player wins P-position - The set of states in which the second player wins A word on terminology: We assume that each state is a starting state, thus the first player is the Next player and the second player is the Previous player
  • 23. Sprague Grundy Function • Let G the graph of the game. 1. The followers of N-positions are P-Positions 2. The followers of P-positions are N-Positions The Sum of Games If 𝐺1 = 𝑉1 , 𝐸1 , 𝐺2 = 𝑉2 , 𝐸2 are games then Their Sum is: G={𝑉1 × 𝑉2 ,E} and if 𝑣1 ,𝑤1 ∊ 𝑉1 , 𝑣2,𝑤2 ∊ 𝑉2 E={(𝑣1𝑣2, 𝑤1𝑣2) }∪{(𝑣1𝑣2,𝑣1𝑤2) }
  • 24. Sprague Grundy Function What Can we say about the sum? P+P->P N+P->N N+N->? We should do better in order to know the game!
  • 25. The mex Function • Let N the integers and S s.t. S⊊ N mex(S)= min(N/ S) • Let the graph: G={𝑉,E} we define a function g: V-> N g (v) = mex{g(w) |(v,w) ∊E} • g (v) =0 ⇔ v is P-position • For 𝐺1, 𝐺2 𝑠𝑢b − games and G their sum we have g (v)=g(𝑣1) ⊕ g(𝑣2)
  • 26. Impartial Games (Cont.) Sprague Grundy Theorem Every normal impartial game is isomorphic to Nim !! (Hence Nim is not only a game but a class of games)
  • 27. What is Nim? • A two players game originally from China. • It has been played in Europe since the 16th century • 1901 Bouton gave the game its name and.. A nice mathematical study.
  • 28. How do we play? • Two players • N heaps of items where the number of items for heap i is 𝑘𝑖 • In each turn a player should pick an amount of items : 1. He must pick! 2. He can pick only from a single heap! • Winning Strategy –Simply xor the heap sizes (By induction)
  • 29. Examples • A single heap with M items . Who is going to win? • Two heaps one of K items and one of 1.. • Two heaps one of M and one of K (M >K)
  • 30. The Three Heaps Example • We have three heaps 3,4,5 items Let’s Play (1,4,5) (1,2,5) (1,2,3) (1,2,2) (0,2,2) …
  • 31. Nim & QLearning • There is a natural similarity between games and reinforcement: 1. A collections of states S 2. For each s ∊ S There are potential actions that a player can take. 3. Every step provides a reward
  • 32. QLearning for Nim-Design & implementation We build a Nim engine for three heaps (Eric Jarleberg) State • A set of actions that are available • A Boolean variable that indicates whether this is a goal (true (0,0,0)) • A function that performs the XOR • A function that performs an action and thus jumps to another state Actions • A number that indicates the heap in which we perfrom the action • A number that indicates the amount of items
  • 33. QLearning for Nim-Design & implementation Agent Agent Types 1. Random Agent- Simply gets the state and randomly picks an action 2. Optimal Agent –The one that plays the optimal strategy 3. Qlearning Agent – An agent that learns to play through Q functions Agent Implementation 1. A function that given a state recommends an action for the agent 2. A function that provides a feedback
  • 34. QLearning for Nim-Design & implem. • What is our objective? 1. We wish to learn as fast as possible 2. We wish to have an accurate Q function Remarks 1. The learning process is sensitive to opponents 2. We may obtain an optimal policy and not having good Q
  • 35. QLearning for Nim • Loss function: We learn a policy π and we have for each game state s : 𝑈π 𝑠 = 𝑖=0 ∞ γ𝑖 R(𝑆𝑖) • Originally we have for each 𝑠 its 𝑈 and we estimate the learning by minimizing : │𝑈π -𝑈∗│ (i.e. min 𝐿1 ) .
  • 36. How to Estimate? 1. 𝐿1 -Simply minimize │𝑈π -𝑈∗│ 2. For every state there is only one “right action”, We can measure thus the amount of failures rather the utility function. What we do is counting the wrong actions for each s ∈ N . (Due to remark 2)
  • 38. Scenarios 1 Q-agent-> Optimal Player 2 Q-agent -> Q-agent 3 Q-agent -> Random
  • 40. Q-Q
  • 43. Remarks 1. The most important factor is the opponent 2. The Q function maps positive values to states in N i.e.: Learning to play is equivalent to classify states to N & P Corollary: We can use Qlearning to any impartial game 3. An interesting step can be to perform actor-critic and compare policies through KL-divergence.
  • 44. Good Sources • https://github.com/jamesbroadhead/Axelrod/blob/master/axelrod/strategies/qlearner.py • https://github.com/NimQ/NimRL • http://www.csc.kth.se/utbildning/kth/kurser/ -RL for Nim Erik Jarelberg • https://papers.nips.cc/paper/2171-reinforcement-learning-to-play-an-optimal-nash-equilibrium-in-team-markov- games.pdf • https://www.researchgate.net/profile/J_Enrique_Agudo/publication/221174538_Reinforcement_Learning_for_the_N- Persons_Iterated_Prisoners'_Dilemma/links/5535ec0c0cf268fd0015f0ac/Reinforcement-Learning-for-the-N-Persons- Iterated-Prisoners-Dilemma.pdf • Gabriel Nivasch –Ariel University • Sutton & Barto