Choosing between several options in uncertain environments

270 views

Published on

Y

Published in: Engineering
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
270
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Choosing between several options in uncertain environments

  1. 1. METAGAMING: Bandits with simple regret and small budget Chen-Wei Chou, Ping-Chiang Chou, Chang-Shing Lee, David Lupien St-Pierre, Olivier Teytaud, Mei-Hui Wang, Li-Wen Wu and Shi-Jim Yen
  2. 2. Outline: - what is a bandit problem ? - what is a strategic bandit problem ? - is a strategic bandit different from a bandit ? - algorithms - results
  3. 3. What is a bandit problem ? A finite number of time steps A (finite) number of options, each of them equipped with a (unknown) proba distribution At each time step: - you choose one option - you get a reward, distributed according to its proba distribution At the end: - you choose one option (you can not change anymore...) - your reward is the expected reward associated to this option
  4. 4. What is a bandit problem ? A finite number of time steps A (finite) number of options, each of them equipped with a (unknown) proba distribution At each time step: - you choose one option - you get a reward, distributed according to its proba distribution At the end: - you choose one option (you can not change anymore...) - your reward is the expected reward associated to this option Here we collect information
  5. 5. What is a bandit problem ? A finite number of time steps A (finite) number of options, each of them equipped with a (unknown) proba distribution At each time step: - you choose one option - you get a reward, distributed according to its proba distribution At the end: - you choose one option (you can not change anymore...) - your reward is the expected reward associated to this option Here we use information for the final choice
  6. 6. What is a bandit problem ? A finite number of time steps A (finite) number of options, each of them equipped with a (unknown) proba distribution At each time step: - you choose one option - you get a reward, distributed according to its proba distribution At the end: - you choose one option (you can not change anymore...) - your reward is the expected reward associated to this option Here, we explore
  7. 7. What is a bandit problem ? A finite number of time steps A (finite) number of options, each of them equipped with a (unknown) proba distribution At each time step: - you choose one option - you get a reward, distributed according to its proba distribution At the end: - you choose one option (you can not change anymore...) - your reward is the expected reward associated to this option Here, we take no risk
  8. 8. What is a bandit problem ? A finite number of time steps A (finite) number of options, each of them equipped with a (unknown) proba distribution At each time step (exploration): - you choose one option - you get a reward, distributed according to its proba distribution At the end (recommendation): - you choose one option (you can not change anymore...) - your reward is the expected reward associated to this option
  9. 9. Which kind of bandit ? - in the bandit literature, options are also termed “arms” - here the criterion is the expected reward of the option chosen at the end (sometimes it is the sum of the rewards during exploration) - we presented here stochastic bandits (a probability distribution per option) ==> next slide is different
  10. 10. And adversarial bandit ? A finite number of time steps A (finite) number of options for player 1, and a finite number of options for player 2. An unknown probability distribution for each pair of options At each time step: - you choose one option for P1 and one option for P2 - you get a reward, distributed according to the corresponding proba distribution At the end: - you choose one **probabilistic** option for P1 (you can not change anymore...) - your reward is the expected reward associated to this option, for the worst choice by P2
  11. 11. What is meta-gaming ? What is “strategic choice” ? Strategic choices: - decisions once and for all, at a high level - ≠ from tactical level Meta-gaming: choice at a strategic level, in games: - choosings cards, in card games - choosing handicap positioning, in Go ==> once and for all, at the beginning of the game
  12. 12. Example of stochastic bandit (i.e. 1P strategic choice) Game of Go handicap bandit problem, at each time step: - you choose one handicap positioning - then you simulate one game from this position ==> only one player has a strategic choice ==> stochastic bandit
  13. 13. Example of adversarial bandit (i.e. 2P strategic choice) Urban Rivals bandit problem, at each time step: - you choose - one set of cards for you (P1) - one set of cards for P2 - then you simulate one Urban Rivals game from this position PLAYER 1: PLAYER 2: ==> two players have a strategic choice ==> adversarial bandit
  14. 14. Is a strategic bandit problem different from a classical bandit problem ? No difference in nature Just a much smaller budget
  15. 15. Algorithms Reminder: - two algorithms needed: - one for choosing during N exploration steps - one for choosing during 1 recommendation step - two settings - one-player case - two-player case
  16. 16. Algorithms for exploration Uniform: test all options uniformly Bernstein races: - uniformly among non discarded options, - discard options with statistical tests Successive reject: - uniformly among non discarded options, - discard periodically the worst option UCB: choose option with best average result + bonus for options weakly sampled, Adaptive-UCB-E: a variant of UCB aimed at removing hyper-parameters EXP3: empirically best option + random perturbation
  17. 17. Algorithms for recommendation Empirically Best Arm: choose empirically best option Most Played Arm: choose most simulated option Successive reject:: the only non discarded option UCB: choose option with best average result + bonus for options weakly sampled. LCB: choose option with best average result + malus for options weakly sampled. Empirical distribution of play: an option has its frequency (during exploration) as probability (for recommendation) TEXP3: idem, but discard low probability options
  18. 18. Experimental results Big boring tables of results are in the paper. Only a sample of most clear results here.
  19. 19. One player case Killall Go stones positionning
  20. 20. One player case Killall Go stones positionning Uncertainty should have malus in recommend.
  21. 21. One player case Killall Go stones positionning EXP3 for 2player case
  22. 22. Experimental results: TEXP3 outperforms EXP3 by far 2-player case, game = Urban Rivals (free online card game)
  23. 23. Do you know killall-Go ? Black has stones in advance (e.g. 8 in 13x13). If white makes life, white wins. If black kills everything, black wins. Black choose stones positioning (strategic decisions).
  24. 24. Left: human is Black and chooses E3 C4. Right: computer is Black and chooses D3 D5. White won both. Human said that the computer choice D3 D5 is good.
  25. 25. Killall Go, H8 (left) H9 (right) Left: Human Pro Player (5P) as black has 8 handicap stones. White (computer) makes life and wins. Right: Human Pro Player (5P) as black has 9 handicap stones and kills everything and wins.
  26. 26. CONCLUSIONS 1 player case: UCB for exploration, LCB or MPA for recommendation 2 player case: TEXP3 performs best. Killall-Go Win against pro with H2 in 7x7 Killall-Go as white. Loss against pro with H2 in 7x7 Killall-Go as black. 13x13: Computer won as white with H8, lost with H9. 13x13: Computer lost as black with H8 and with H9. Further work: Structured bandit: some options are close to each other. Batoo: Go with strategic choice for both players; nice test case. Industry: choosing investments for power grid simulations – in progress.

×