Upcoming SlideShare
×

1,004 views
923 views

Published on

This is the project presentation for Adaptive Filter Theory course I was doing this winter.

Published in: Education
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
1,004
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
23
0
Likes
0
Embeds 0
No embeds

No notes for slide

1. 1. EECS 463 Course Project 1 ADAPTIVE LEARNING IN GAMES 3/11/2010 Suvarup Saha
2. 2. Outline 2 Motivation Games Learning in Games Adaptive Learning Example Gradient Techniques Conclusion EECS 463 Course Project 3/11/2010
3. 3. Motivation 3 Adaptive Filtering Techniques generalize to a lot of applications outside Gradient Based iterative search Stochastic Gradient Least Squares Application of Game Theory in less than rational multi- agent scenarios demand self-learning mechanisms Adaptive techniques can be applied in such instances to help the agents learn the game and play intelligently EECS 463 Course Project 3/11/2010
4. 4. Games 4 A game is an interaction between two or more self-interested agents Each agent chooses a strategy si from a set of strategies, Si A (joint) strategy profile, s, is the set of chosen strategies, also called an outcome of the game in a single play Each agent has a utility function, ui(s), specifying their preference for each outcome in terms of a payoff An agent’s best response is the strategy with the highest payoff, given its opponents choice of strategy A Nash equilibrium is a strategy profile such that every agent’s strategy is a best response to others’ choice of strategy EECS 463 Course Project 3/11/2010
5. 5. A Normal Form Game 5 B b1 b2 A a1 4,4 5,2 a2 0,1 4,3 This is a 2 player game with SA={a1,a2}, SB={b1,b2} The ui(s) are explicitly given in a matrix form, for example uA(a1, b2) = 5, uB(a1, b2) = 2 The best response of A to B playing b2 is a1 In this game, (a1, b1) is the unique Nash Equilibrium EECS 463 Course Project 3/11/2010
6. 6. Learning in Games 6 Classical Approach: Compute an optimal/equilibrium strategy Some criticisms to this approach are Other agents’ utilities might be unknown to an agent for computing an equilibrium strategy Other agents might not be playing an equilibrium strategy Computing an equilibrium strategy might be hard Another Approach: Learn how to ‘optimally’ play a game by playing it many times updating strategy based on experience EECS 463 Course Project 3/11/2010
7. 7. Learning Dynamics 7 Rationality/Sophistication of agents Evolutionary Adaptive Bayesian Dynamics Learning Learning Focus of Our Discussion EECS 463 Course Project 3/11/2010
8. 8. Evolutionary Dynamics 8 Inspired by Evolutionary Biology with no appeal to rationality of the agents Entire population of agents all programmed to use some strategy Players are randomly matched to play with each other Strategies with high payoff spread within the population by Learning copying or inheriting strategies – Replicator Dynamics Infection Stability analysis – Evolutionary Stable Strategies (ESS) Players playing an ESS must have strictly higher payoffs than a small group of invaders playing a different strategy EECS 463 Course Project 3/11/2010
9. 9. Bayesian Learning 9 Assumes ‘informed agents’ playing repeated games with a finite action space Payoffs depend on some characteristics of agents represented by types – each agent’s type is private information The agents’ initial beliefs are given by a common prior distribution over agent types This belief is updated according to Bayes’ Rule to a posterior distribution with each stage of the game. In every finite Bayesian game, there is at least one Bayesian Nash equilibrium, possibly in mixed strategies EECS 463 Course Project 3/11/2010
10. 10. Adaptive Learning 10 Agents are not fully rational, but can learn through experience and adapt their strategies Agents do not know the reward structure of the game Agents are only able to take actions and observe their own rewards (or oppnents’ rewards as well) Popular Examples Best Response Update Fictitious Play Regret Matching Infinitesimal Gradient Ascent (IGA) Dynamic Gradient Play Adaptive Play Q-learning EECS 463 Course Project 3/11/2010
11. 11. Fictitious Play 11 The learning process is used to develop a ‘historical distribution’ of the other agents’ play In fictitious play, agent i has an exogenous initial weight function kit: S-i R+ Weight is updated by adding 1 to the weight of each opponent strategy, each time it is played The probability that player i assigns to player -i playing s-i at date t is given by qit(s-i) = kit(s-i) / Σ kit(s-i) The ‘best response’ of the agent i in this fictitious play is given by sit+1 = arg max Σ qit(s-i)ui(si, s-it) EECS 463 Course Project 3/11/2010
12. 12. An Example 12 Consider the same 2x2 game example as before B Suppose we assign b1 b2 kA0 (b1)= kA0 (b2)= kB0 (a1)= kB0 (a2)= 1 A a1 4,4 5,2 Then, qA0 (b1)= qA0 (b2)= qB0 (a1)= qB0 (a2)= 0.5 a2 0,1 4,3 For A, if A chooses a1 qA0(b1)uA(a1, b1) + qA0(b2)uA(a1, b2) = .5*4+.5*5 = 4.5 while if A chooses a2 qA0(b1)uA(a2, b1) + qA0(b2)uA(a2, b2) = .5*0+.5*4 = 2 For B, if B chooses b1 qB0(a1)uB(a1, b1) + qB0(a2)uB(a2, b1) = .5*4+.5*1 = 2.5 while if B chooses b2 qB0(a1)uB(a1, b2) + qB0(a2)uB(a2, b2) = .5*2+.5*3 = 2.5 Clearly, A plays a1 , B can choose either b1 or b2; assume B plays b2 EECS 463 Course Project 3/11/2010
13. 13. Game proceeds. 13 stage 0 A’s selection a1 B’s selection b2 A’s payoff 5 B’ payoff 2 kAt(b1), qAt(b1) 1, 0.5 1, 0.33 kAt(b2), qAt(b2) 1, 0.5 2, 0.67 kBt(a1), qBt(a1) 1, 0 .5 2, 0.67 kBt(a2), qBt(a2) 1, 0 .5 1, 0.33 EECS 463 Course Project 3/11/2010
14. 14. Game proceeds.. 14 stage 0 1 A’s selection a1 a1 B’s selection b2 b1 A’s payoff 5 4 B’ payoff 2 4 kAt(b1), qAt(b1) 1, 0.5 1, 0.33 2, 0.5 kAt(b2), qAt(b2) 1, 0.5 2, 0.67 2, 0.5 kBt(a1), qBt(a1) 1, 0 .5 2, 0.67 3, 0.75 kBt(a2), qBt(a2) 1, 0 .5 1, 0.33 1, 0.25 EECS 463 Course Project 3/11/2010
15. 15. Game proceeds… 15 stage 0 1 2 A’s selection a1 a1 a1 B’s selection b2 b1 b1 A’s payoff 5 4 4 B’ payoff 2 4 4 kAt(b1), qAt(b1) 1, 0.5 1, 0.33 2, 0.5 3, 0.6 kAt(b2), qAt(b2) 1, 0.5 2, 0.67 2, 0.5 2, 0.4 kBt(a1), qBt(a1) 1, 0 .5 2, 0.67 3, 0.75 4, 0.2 kBt(a2), qBt(a2) 1, 0 .5 1, 0.33 1, 0.25 1, 0.8 EECS 463 Course Project 3/11/2010
16. 16. Game proceeds…. 16 stage 0 1 2 3 A’s selection a1 a1 a1 a1 B’s selection b2 b1 b1 b1 A’s payoff 5 4 4 4 B’ payoff 2 4 4 4 kAt(b1), qAt(b1) 1, 0.5 1, 0.33 2, 0.5 3, 0.6 4, 0.67 kAt(b2), qAt(b2) 1, 0.5 2, 0.67 2, 0.5 2, 0.4 2, 0.33 kBt(a1), qBt(a1) 1, 0 .5 2, 0.67 3, 0.75 4, 0.2 5, 0 .84 kBt(a2), qBt(a2) 1, 0 .5 1, 0.33 1, 0.25 1, 0.8 1, 0.16 EECS 463 Course Project 3/11/2010
17. 17. Gradient Based Learning 17 Fictitious Play assumes unbounded computation is allowed in every step – arg max calculation An alternative is to proceed in gradient ascent on some objective function – expected payoff Two players – row and column – have payoffs r r  c c  R= 11 r  and 12 C=  11 12  r 21  22 c c  21 22 Row player chooses action 1 with probability α while column player chooses action 2 with probability β Expected payoffs are Vr (α, β ) = r11αβ + r12α (1 − β ) + r21(1 − α)β + r22 (1 − α )(1 − β ) Vc (α , β ) = c11αβ + c12α (1 − β ) + c21 (1 − α )β + c22 (1 − α )(1 − β ) EECS 463 Course Project 3/11/2010
18. 18. Gradient Ascent 18 Each player repeatedly adjusts her half of the current strategy pair in the direction of the current gradient with some step size η ∂Vr (α k , β k ) α k +1 = α k + η ∂α ∂V (α , β ) β k +1 = βk +η c k k ∂β In case the equations take the strategies outside the probability simplex, it is projected back to the boundary Gradient ascent algorithm assumes a full information game – both the players know the game matrices and can see the mixed strategy of their opponent in the previous step u = (r11 + r22 ) − (r21 + r12 ) u' = (c11 +c22) −(c21 +c12) ∂Vr (α , β ) ∂Vc (α , β ) = βu − (r22 − r12 ) = αu ' − (c 22 − c 21 ) ∂α ∂β EECS 463 Course Project 3/11/2010
19. 19. Infinitesimal Gradient Ascent 19 Interesting to see what happens to the strategy pair and to the expected payoffs over time Strategy pair sequence produced by following a gradient ascent algorithm may never converge Average payoff of both the players always converges to that of some Nash pair Consider a small step size assumption – limη →0 so that the update equations become  ∂α   ∂t  0 u  α   − ( r22 − r12 )   ∂β = ' +   u 0   β   − ( c 22 − c 21 )        ∂t  Point where the gradient is zero – Nash Equilibrium c − c r22 − r12  (α * , β * ) =  22 ' 21 ,  u u   This point might even lie outside the probability simplex. EECS 463 Course Project 3/11/2010
20. 20. IGA dynamics 20 Denote the off-diagonal matrix containing u and u’ by U Depending on the nature of U (noninvertible, real or imaginary e-values) the convergence dynamics will vary EECS 463 Course Project 3/11/2010
21. 21. WoLF - W(in)-o(r)-L(earn)-Fast 21 Introduces variable learning rate instead of a fixed η ∂Vr (α k , β k ) α k +1 = α k + ηl r ∂α k ∂ V c (α k , β k ) β k +1 = β k + η l kc ∂β Let αe be the equilibrium strategy selected by the row player and βe be the equilibrium strategy selected by the column player l Vr (αk , βk ) > Vr (α e , βk ) →Winning l =  min r k l max → otherwise Losin g l Vc (αk , βk ) > Vc (αk , β e ) →Winning l c k =  min  l max → otherwise Losing If in a two-person, two-action, iterated general-sum game, both players follow the WoLF-IGA algorithm (with lmax>lmin) then their strategies will converge to a Nash equilibrium EECS 463 Course Project 3/11/2010
22. 22. WoLF-IGA convergence 22 EECS 463 Course Project 3/11/2010
23. 23. To Conclude 23 Learning in games is popular in anticipation of a future in which less than rational agents play a game repeatedly to arrive at a stable and efficient equilibrium. The algorithmic structure and adaptive techniques involved in such learning are largely motivated by Machine Learning and Adaptive Filtering A Gradient- based approach relieves this computational burden but might suffer from convergence issues A stochastic gradient method (not discussed in the presentation) makes use of minimal information available and still performs near-optimally EECS 463 Course Project 3/11/2010