Your SlideShare is downloading. ×
  • Like
mingdraft2.doc
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply
Published

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
188
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 3. Reinforcement learning 3.1 The Q-learning algorithm If we move from the strategy regime of full rationality to regime of exploring rationality where players are willing to take some risks to explore their opponents and the game structure, we will have more than one way to model the players. Here we are attracted by the class of reinforcement learning which seems to describe human behavior better (see Roth and Erev 1995). It is inspired by learning theory in psychology which says the likelihood to choose one action is strengthened if this action leads to favorable outcome. This characteristic is observed to be quite robust in learning behavior of human and animals. At the same time, reinforcement learning has been widely used in machine learning community to solve individual decision problem (see Sutton and Barto 1998). But it has rarely been treated as in a strategic environment (like in repeated games) where the learning behaviors of others have an important impact on that of each other and the outcome. So it is also interesting to see how reinforcement learning performs in such strategic environment from machine learning’s perspective. In the following part, we will describe the specific reinforcement learning algorithm we used for our experiment-Q-learning. The more detailed information can be found at Watkins 1989, Watkins and Dayan 1992 and Sutton and Barto 1998. Q-learning algorithm works by estimating the values of state-action pairs. The value Q(s,a) is defined to be the expected discounted sum of future payoffs obtained by taking action a from state s and following an optimal policy thereafter. Once these values have been learned, the optimal action from any state is the one with the highest Q-value. The standard procedure for Q-learning is as follows. Assume that Q(s,a) is represented by a lookup table containing a value for every possible state-action pair, and that the table entries are initialized to arbitrary values. Then the procedure for estimating correct Q(s,a) is to repeat the following loop until termination criterion is met: 1. In current state s choose an action a, this will cause a receipt of an immediate reward r, and arrival at a next state s'. 2. Update Q(s, a) according to the following equation: ∆Q( s, a ) = α [r + γ max Q( s' , b) − Q( s, a )] b (1) where αis the learning rate parameter. In the context of repeated games, the player is exploring the environment (its opponent and the game structure) by taking some risk to choose the action that might not be the current optimal one in step 1. In step 2 the action that leads to higher reward will strengthen the Q value for that state- action pair. The above procedure is guaranteed to converge to the correct Q values for stationary MDPs. In practice the exploration strategy in step 1 is usually chosen so that it will ensure sufficient exploration while still favoring actions with higher value estimates in given state. A variety of Draft.doc 4/26/2010 1/14
  • 2. methods may be used. A simple one is to behave greedily most of the time, but with small probability ε, choose an action at randomly from those that do not have the highest Q value. This action selection method is called ε-greed in Sutton and Barto 1995. There is another one called Softmax action selection method where the action with higher value is more likely to be chosen in given state. The most common form for the probability of choosing action a is e Qt ( a ) / τ ∑ bn=1 e Qt (b) / τ (2) where τis a positive parameter and decreases over time. In the limit as τ→0, Softmax action selection becomes greedy action selection. In our experiment we tried bothε-greed and Softmax action selection. 3.2 implementation of Q-learning for 2 by 2 games Q-learning does not need a model of its environment and can be used on-line. Therefore, it is very suited for repeated games against an unknown opponent (especially someone with same adaptive behavior). Here we will focus on some repeated games which are 2 by 2 games. Considering the iterated prisoner’s dilemma where “tit for tat” has been often discussed, it is natural to represent the state as the out come of the previous game played. We say in this case the player has memory length of one. The number of states for 2 by 2 game is 4 and for each state there are two actions (the pure strategies) from which the player can choose for current game. We also conducted the experiments for the case that players have memory length of two (the number of states will be 16). The immediate reward player gets is the payoff in the payoff matrix. For the softmax action selection method, we set the decreasing rate of the parameter τas the following τ = T *ϑ n (3) T is a constant, n is number of games played so far. θ, called annealing factor, is a positive constant that is less than one. In the implementation, when n is getting large enough,τ is close to zero and the player stops exploring. We start usingε-greed after that point to keep player exploring. 4. Experiments 4.1 The motivation The repeated 2 by 2 games are the simplest settings for strategic interactions and should be a good starting point to investigate how different the outcome would be under exploring rationality from full rationality. Take the iterated prisoner’s dilemma as an example, if the player takes risk cooperating in some round, hoping to induce cooperation later from its opponent who may think in the same way, it’s possible that they both find out later that they can get more by mutual cooperation. Even in early stage it may lose some, later sustained mutual cooperation is a good reason to explore at the early stage. Draft.doc 4/26/2010 2/14
  • 3. Motivated by this intuition, we deliberately selected 8 games and parameterized their payoff matrix. The players are modeled as using Q-learning in each repeated game. For 5 of them, the Pareto optimal solution does not coincide with Nash equilibrium. The rest are games with two Nash equilibria, which we included to address the multi-equilibrium selection issue. 4.2 The games and the parameterization The following is the list of the games and how we parameterized their payoff matrix. The parameter is δ. In the payoff matrix, the first number is the payoff of the row player and the second one is payoff of column player. We mark the NE with # and Pareto optimal solution with *. C and D are the action or pure strategies that players can take. The row player always comes first. When we say outcome of one play is CD, that means the row player chose pure strategy C and column play choose pure strategy D. So there are four outcomes of one play: CC CD, DC, and DD. The first two games are taken from prisoner’s dilemma. The value of δ is from 0 to 3. When its value is 2 in table 4.1, it is corresponding to the most common payoff matrix for prisoner’s dilemma. C D C (3,3 )* (0,3+δ) Defection (3+δ,0) (3-δ, 3-δ)# Table 4.1: Prisoner’s dilemma Pattern1 C D C (3,3)* (0, 3+δ) D (3+δ,0) (δ, δ)# Table 4.2: Prisoner’s dilemma Pattern2 While the above two are symmetric games, the following three are asymmetric games adopted from Rapoport and Guyer( Steve, can you find the reference? ). They do not have same number for payoff as the original ones by Rapoport and Guyer. The value of δ is taken from 0 to 3. C D C (0.2 ,0.3)# 0.3+δ,0.1 D 0.1, 0.2 (0.2+δ, 0.3+δ)* Table 4.3: game #47 C D C (0.2 ,0.2)# (0.3+δ,0.1) D (0.1, 0.3) (0.2+δ, 0.3+δ)* Table 4.5: game #48 C D Draft.doc 4/26/2010 3/14
  • 4. C (0.2 ,0.3)# (0.3+δ,0.2) D (0.1, 0.1) (0.2+δ, 0.3+δ)* Table 4.5: game #57 For the game with two Nash equilibria, one challenging question is which one is more likely to be selected as the outcome. We choose three from this class of game. The game of Stag Hunt has Pareto optimal solution as one of Nash equilibrium. The game of Chicken and the game of battle of sexes are coordination games. The value of δ is taken from 0 to 3 for Stag Hunt and Bottle of sexes. For Chicken, δ is from 0 to 2. Note that there is no Pareto optimal solution for the last two coordination games. C D C (5,5)* (0,3) D (3,0) (δ,δ) Table 4.6: Stag Hunt C D C (δ,3-δ)# (0,0) D (0,0) (3-δ,δ)# Table 4.7: Battle of sexes C D C (2,2) (δ, 2+δ)# D (2+δ, δ)# (0, 0) Table 4.8: Chicken 4.3 The setting for the experiments The parameters for Q-learning are set as following: Learning rate is set to 0.2 and discount factor as 0.95. We ran the experiment with both softmax action selection and ε-greed action selection. For softmax action selection, T is set to 5 and annealing factor as 0.9999. When τ is less than 0.01, we began usingε-greed. We set εto 0.01. We have chosen these parameter values after those used by other studies (Sandholm and Crites 1995) Each repeated game has 200,000 iterations so the players are given enough time to explore and learn. For each setting of the payoff parameter δ, we ran the repeated game for 100 trials. We recorded the frequencies of four outcomes (CC, CD, DC and DD) every 100 iteration. The numbers usually become stable within 50,000 iterations, so we took frequencies of the outcomes in the last 100 iterations over the 100 trials to report if not noted otherwise. Tables in the appendix share similar layout. In the middle column is the payoff parameter δ. On its left is result for ε-greedy action selection. The result for softmax action selection is on the right. Again, the numbers are frequencies of four outcomes (CC, CD, DC and DD) in the last 100 iterations over 100 runs. Draft.doc 4/26/2010 4/14
  • 5. 4.4 The results It’s disturbing when sometimes classical game theory tell you that only the inferior Nash equilibrium will be the outcome , not the Pareto optimal solution (not necessarily a Nash equilibrium). As in our first five games, the Subgame perfect Nash equilibrium will never be the Pareto outcome. So the question is: will the outcome be different if we use models (such as reinforcement learning)that better describe human behavior? How reasonable is it? More specifically, can the player learn to play the Pareto optimal solution that is not Nash equilibrium? Our experiments show that the answer is positive. Take a look at Table 1 for Prisoner’s dilemma patter1, if δis close to zero, the two players choose to defect most of the time. The reason is that there is not much difference between the outcome of mutual defection and mutual cooperation. The Pareto outcome does not provide enough incentive for players to take risk inducing cooperation. To avoid being exploited and get zero payoff, they’d better choose defection all the time. But as δis getting larger, there are more mutual cooperations observed which suggests both players are trying to settle on the Pareto optimal outcome. The last row in Table 1 shows an interesting scenario: The players want to induce the other player’s cooperation so it can take advantage of that by defection because the temptation for defection is really large when the other player is cooperating. That’s why we see many CDs and DCs, but less mutual cooperation (CC). It becomes more illustrative comparing with the result for prisoner’s dilemma pattern 2 in Table 2. In prisoner’s dilemma pattern 2, the players lose almost nothing by trying to cooperate when δ is close to zero. The exploration helps players to reach the much superior Pareto outcome (CC) and as we can see from Table 2, mutual cooperation happens 94% of time. Considering the scenario whenδis close to 3, first, there is not much incentive to shift from Nash equilibrium (DD) to Pareto outcome (CC) since there is not much difference in payoffs; second, the danger of being exploited by the other player and getting zero payoff is much higher, finally the players learn to defect most of the time (98%) Now let’s turn to different class of game. Game#47,game #48 and game#57 are asymmetric games and they have a common feature: The row player has a dominant strategy C because this strategy always give higher payoff than the strategy D no matter what the other player’s action is. Thus a fully rational player will never choose D. What will happen if players are enabled to explore and learn? Table 3-5 tells us that it depends on the payoff. Ifδ is close to zero, the out come will be Nash equilibrium (CC) almost 97% of the time since it doesn’t pay to induce other player to achieve the Pareto optimal solution and more likely it will be ripped off by other player if doing so. But as long as the incentive from Pareto is large enough, there will be considerable amount time (above 94%) that Pareto outcomes (DD) being observed. The Stag hunt problem is interesting because its Pareto optimal solution is also one of its Nash equilibrium. But which one is more likely to be sustained remains challenging problem for classical game theory. A mixed strategy (i.e., with some fixed probability to choose one of the pure strategies) seems natural in this repeated game for classical game theory. Table 6 shows that the outcomes of this repeated games with players with reinforcement learning model is much different Draft.doc 4/26/2010 5/14
  • 6. from the prediction of mixed strategy. Say, for example, when δ is equal to 1, the mixed strategy for both players will be choosing action C with probability 1/3 and D with probability 2/3. We should expect to see CC less than 33% of the time while Table 6 shows CC happens 88% of the time. We also can see that when the Pareto outcome is far more superior to the other Nash equilibrium, it is chosen almost 94% of the time. The rest two games are coordination games and we are not only concerned about which Nash equilibrium is to be selected, but also a further question: Is Nash equilibrium concept sufficient to describe what happens in these games. The later concern arises as we observe different behaviors in human experiment. Rapport et al. (1976) reported a majority of subjects quickly settled into an alternating strategy, with the outcome changing back and forth between the two Nash equilibria when playing the game of Chicken. From Table 7 we can see these two Nash equilibrium in battle of sexes are equal likely to be the outcome in most cases since the game is symmetric and these two outcomes are superior than other two which give both player zero payoff. As in game of Chicken, Table 8 shows that if the incentive for coordinating on the Nash equilibrium is too little (i.e., δ is close to zero), the players learn to be conservative both at the same time (CC) since they can not afford the loss in situation of DD (getting zero). Asδ increases, the game ends up more and more with Nash equilibrium (CD and DC). In order to see if players can learn the alternating strategy as observed in human subject experiments, we conducted another 100 trials for these two games withδbeing set to 1 and with softmax action selection. For most of the trials the outcome converges to one of the Nash equilibrium. But we did observe patterns showing alternating strategies for both games. These patterns are quite stable and can recover quickly from small random disturbance. For battle of sexes, there is only one pattern that the players play the two Nash equilibria alternately. The total number of this pattern is 11 (out of 100 trials). For game of Chicken, There are other kinds of patterns that summarized with their frequencies in Table 4.9 The outcomes Frequency in 100 trials Alternating between CD and DC 10 Cycle through CD-DC-CC or CD-CC-DC 13 Converge to one of the three: CC, CD or DC 76 No obvious patterns 1 Table 4.9: Frequencies of different kinds of outcome in the game of Chicken The frequencies of patterns can not be said as considerable, but first we use numbers in the payoff matrix that are different from Rapport et al. (1976) which may influence the incentive to form such strategy; and second, our players do not explicitly know about the payoff matrix and can only learn the payoff of structure of its opponent implicitly through behavior of its opponent (that’s not a easy task), and finally we think there might be some features of human behavior that are not captured in our current Q-learning model but important for human subject to learn such alternating strategy. But our main point is clear here: the Nash equilibrium concept seems not sufficient for Draft.doc 4/26/2010 6/14
  • 7. describing the out comes of repeated coordination games as Chicken and battle of sexes. *Note: To save space, there are additional results we do not discuss here, but we add them to appendix for completeness. 1. We conducted all the experiments by setting the memory length of the players to 2. The results are shown in table 9-16. 2. we set ε in ε –greedy for row player to 0.03 (the column player’s ε remains at 0.01) and repeated the experiment with ε –greedy action selection and memory length as 1 on game #47, result is summarized in table 17 3. we set ε in ε –greedy for row player to 0.03 (the column player’s ε remains at 0.01) and repeated the experiment with ε –greedy action selection on game of Chicken and battle of sexes. The Frequencies for patterns are reported in Table 18-21. 5. Discussion Appendix ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 3 87 82 9828 0.05 0 106 101 9793 0 92 105 9803 0.5 0 90 94 9816 52 110 111 9727 1 1 111 111 9777 Draft.doc 4/26/2010 7/14
  • 8. 51 110 93 9746 1.25 2475 338 358 6829 1136 160 198 8506 1.5 3119 526 483 5872 1776 245 381 7598 1.75 4252 653 666 4429 3526 547 413 5514 2 789 883 869 7549 848 766 779 7607 2.5 496 2276 2368 4860 544 2313 2306 4837 2.95 539 2821 2112 4528 Table 1: Prisoner’s dilemma Pattern1 ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 9422 218 183 177 0.05 9334 302 285 79 9036 399 388 150 0.5 9346 294 220 140 5691 738 678 2693 1 7537 954 1267 242 3506 179 275 6040 1.25 8203 542 994 261 1181 184 116 8519 1.5 7818 767 775 640 2 98 103 9797 1.75 4685 270 422 4623 97 114 91 9698 2 1820 217 220 7743 0 100 92 9808 2.5 0 77 117 9806 2 96 94 9808 2.95 0 90 114 9796 Table 2: Prisoner’s dilemma Pattern2 ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 9790 101 101 8 0 9808 94 98 0 4147 137 156 5560 0.1 9812 94 93 1 3019 123 165 6693 0.15 9799 95 104 2 2188 141 132 7539 0.2 8934 85 109 872 185 355 130 9330 0.5 730 284 208 8778 131 309 135 9425 1 120 532 138 9210 138 288 99 9475 1.5 77 471 103 9349 99 321 131 9449 2 88 441 126 9345 126 172 88 9614 3 64 366 92 9478 Table 3: Game #47 ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 9789 102 107 2 0 9787 106 105 2 3173 515 173 6139 0.1 9811 86 101 2 2832 457 207 6504 0.15 8127 256 137 1480 1227 348 141 8284 0.2 2986 755 230 6029 Draft.doc 4/26/2010 8/14
  • 9. 109 627 143 9121 0.5 143 631 146 9080 90 492 139 9279 1 79 1320 126 8475 88 318 134 9460 1.5 117 1076 128 8679 241 236 119 9404 2 62 473 126 9339 76 284 139 9501 3 64 277 128 9531 Table 4: Game #48 ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 9767 119 107 7 0 9764 131 105 0 1684 587 175 7554 0.1 9794 106 98 2 531 518 191 8760 0.15 9550 105 105 240 238 543 159 9060 0.2 1048 497 257 8198 126 307 121 9446 0.5 224 852 152 8772 118 520 114 9248 1 113 753 119 9015 104 526 125 9245 1.5 74 538 117 9271 66 225 102 9607 2 57 569 123 9251 123 296 116 9465 3 61 302 125 9512 Table 5: Game #57 ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 9390 126 122 362 0 9715 108 109 68 9546 91 108 255 0.5 9681 120 121 78 9211 112 125 552 0.75 9669 111 101 119 8864 119 110 907 1 9666 98 102 134 8634 115 132 1119 1.25 9598 139 134 129 7914 122 130 1834 1.5 9465 99 109 327 7822 122 104 1952 2 9452 126 126 296 5936 87 101 3876 2.5 8592 116 89 1203 5266 121 106 4507 3 3524 111 115 6250 Table 6: Stag Hunt ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 2641 63 4571 2725 0 2872 73 4477 2578 3842 135 1626 4397 0.1 4615 101 1732 3552 5140 102 90 4668 0.5 4772 102 162 4964 4828 107 94 4971 1 4862 88 89 4961 4122 101 109 5668 1.5 4642 85 102 5171 4983 100 97 4820 2 4623 97 87 5193 Draft.doc 4/26/2010 9/14
  • 10. 3814 111 96 5979 2.5 5139 102 99 4660 4015 1388 107 4490 2.9 4303 1794 118 3785 2653 4921 70 2356 3 2593 4776 58 2573 Table 7: Battle of sexes ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 9276 227 347 150 0 9509 165 222 104 9587 143 135 135 0.25 9119 428 320 133 9346 209 223 222 0.5 9375 220 225 180 6485 1491 1858 166 0.75 8759 424 632 185 1663 3532 4706 99 1 1339 4903 3662 96 385 4161 5342 112 1.25 158 5416 4323 103 113 4488 5274 125 1.5 115 4700 5099 86 111 4301 5504 84 1.75 100 4704 5083 113 100 4853 4953 94 2 94 4772 5044 90 Table 8: Chicken * Memory length setting is 1 for table 1-8 ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 253 122 129 9496 0.05 2 85 106 9807 860 137 133 8870 0.5 0 98 103 9799 300 124 112 9464 1 4 95 100 9801 227 88 121 9564 1.25 15 160 154 9671 1615 316 325 7744 1.5 309 304 365 9022 2900 1112 1085 4903 1.75 590 682 746 7982 2748 1681 1652 3919 2 281 1522 1476 6721 1919 2927 2988 2166 2.5 389 4235 4111 1265 905 4384 4199 512 2.95 578 4170 4018 1234 Table 9: Prisoner’s dilemma Pattern1 ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 9453 267 258 22 0.05 0.05 9441 277 206 9230 312 314 144 0.5 0.25 9379 260 296 7591 674 633 1102 1 0.5 8940 251 245 4360 483 542 4615 1.25 0.75 8100 444 376 1297 456 443 7804 1.5 1 4311 867 813 2 112 100 9786 1.75 1.25 816 416 406 3 81 99 9817 2 1.5 3 129 117 4 87 101 9808 2.5 2 0 100 95 Draft.doc 4/26/2010 10/14
  • 11. 0 108 91 9801 2.95 2.5 1 93 122 Table 10: Prisoner’s dilemma Pattern2 ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 9772 110 115 3 0 9784 93 123 0 8812 275 163 750 0.1 9810 102 88 0 5200 561 227 4012 0.15 9812 97 88 3 3704 817 210 5269 0.2 9702 111 89 98 290 558 185 8967 0.5 1458 656 224 7662 105 680 159 9056 1 99 646 216 9039 99 404 154 9343 1.5 55 270 182 9493 71 463 124 9342 2 16 224 147 9613 77 271 135 9517 3 34 245 184 9537 Table 11: Game #47 ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 9766 126 108 0 0 9803 100 95 2 7772 179 153 1896 0.1 9821 90 87 2 3775 802 193 5230 0.15 9598 104 104 194 2848 1011 203 5938 0.2 7877 308 204 1611 491 539 202 8768 0.5 1083 947 199 7771 113 758 170 8959 1 93 791 158 8958 114 677 187 9022 1.5 58 567 150 9225 66 477 131 9326 2 33 286 161 9520 66 411 168 9355 3 43 236 152 9569 Table 12: Game #48 ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 8757 715 111 417 0 9765 146 85 4 3084 2534 124 4258 0.1 9634 211 88 67 1553 2235 241 5971 0.15 4634 2717 176 2476 491 1408 192 7909 0.2 2377 2280 207 5136 131 623 170 9076 0.5 152 1331 196 8321 106 1005 201 8688 1 130 791 127 8952 51 582 131 9236 1.5 43 396 135 9426 60 490 151 9299 2 47 471 161 9321 83 570 192 9155 3 62 392 189 9357 Table 13: Game #57 Draft.doc 4/26/2010 11/14
  • 12. ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 9557 122 109 212 0 9742 138 109 11 9052 161 134 653 0.5 9742 123 128 7 8703 151 135 1011 0.75 9613 177 185 25 8493 190 148 1169 1 9670 120 174 36 8197 193 147 1463 1.25 9458 140 110 292 7745 133 122 2000 1.5 9396 166 127 311 5900 148 129 3823 2 8119 158 122 1601 4111 114 142 5633 2.5 4015 156 163 5666 2772 102 126 7000 3 1462 107 115 8316 Table 14: Stag Hunt ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 4124 125 1728 4023 0 3929 93 2132 3846 4835 141 746 4278 0.1 4556 97 580 4767 4092 94 117 5697 0.5 6178 92 97 3633 5788 89 95 4028 1 4592 79 99 5230 5308 94 108 4490 1.5 5012 107 110 4771 4620 138 105 5137 2 4619 121 94 5166 5119 146 93 4642 2.5 4357 121 93 5429 4797 679 144 4308 2.9 4250 1286 143 4321 3797 2112 125 3966 3 3727 2157 106 4010 Table 15: Battle of sexes ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 9342 297 280 81 0 0 9538 206 215 9205 316 405 74 0.25 0.25 9465 256 236 8529 727 684 60 0.5 0.5 9225 372 332 7300 1178 1413 109 0.75 0.75 8012 871 967 1401 3819 4666 114 1 1 1919 4323 3659 223 4728 4937 112 1.25 1.25 255 4957 4661 162 4073 5667 98 1.5 1.5 171 4825 4898 188 5105 4574 133 1.75 1.75 130 4986 4787 149 4693 5064 94 2 2 107 3816 5971 Table 16: Chicken * Memory length setting is 2 for table 9-16 δ CC CD DC DD 0 9594 106 298 2 0.1 7295 162 341 2202 Draft.doc 4/26/2010 12/14
  • 13. 0.15 3193 318 340 6149 0.2 1934 530 325 7211 0.5 314 365 177 9144 1 254 550 149 9047 1.5 253 529 184 9034 2 206 379 151 9264 3 246 471 177 9106 Table 17: Game #47 with ε –greedy action selection and memory length as 1 The outcomes Frequency in 100 trials Alternating between CC and DD 8 Converge to CC or DD 92 Table 18: Battle of sexes with ε –greedy action selection and memory length as 1 The outcomes Frequency in 100 trials Alternating between CD and DC 9 Converge to one of the three: CC, CD or DC 70 Other patterns 11 No obvious patterns 10 Table 19: Chicken with ε –greedy action selection and memory length as 1 The outcomes Frequency in 100 trials Alternating between CC and DD 29 Converge to CC or DD 71 Table 20: Battle of sexes with ε –greedy action selection and memory length as 2 The outcomes Frequency in 100 trials Alternating between CD and DC 26 Converge to one of the three: CC, CD or DC 46 Other patterns 18 No obvious patterns 10 Table 21: Chicken with ε –greedy action selection and memory length as 2 Reference 1. Rapport, Anatol; Guyer, Melvin J. and Gordon, David G.(1976) “The 2X2 game”, Ann Arbor, MI: University of Michigan Press 2. Roth, Alvin E. and Erev, Ido. (1995). “Learning in Extensive-Form games: Experimental Data and Simple Dynamic Models in the Intermediate Term.” Games and Economic Behavior 8, 164-212. 3. Sanholm, Thomas W. and Crites, Robert H.(1995) “ Multi-agent Reinforcement Learning in Iterated Prisoner’s Dilemma”, Biosystems, Vol. 37, 147-166 4. Sutton, R. and Barto, A. (1998) “Reinforcement learning: an Introduction”. MIT Press. Draft.doc 4/26/2010 13/14
  • 14. 5. Watkins and Dayan, 1992 Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning, 8:279-292. 6. Watkins, C. (1989). Learning from Delayed Rewards. PhD thesis, King's College, Oxford. Draft.doc 4/26/2010 14/14