Your SlideShare is downloading. ×
Choosing between several options in uncertain environments
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Choosing between several options in uncertain environments

125
views

Published on

Y

Y

Published in: Engineering

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
125
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. METAGAMING: Bandits with simple regret and small budget Chen-Wei Chou, Ping-Chiang Chou, Chang-Shing Lee, David Lupien St-Pierre, Olivier Teytaud, Mei-Hui Wang, Li-Wen Wu and Shi-Jim Yen
  • 2. Outline: - what is a bandit problem ? - what is a strategic bandit problem ? - is a strategic bandit different from a bandit ? - algorithms - results
  • 3. What is a bandit problem ? A finite number of time steps A (finite) number of options, each of them equipped with a (unknown) proba distribution At each time step: - you choose one option - you get a reward, distributed according to its proba distribution At the end: - you choose one option (you can not change anymore...) - your reward is the expected reward associated to this option
  • 4. What is a bandit problem ? A finite number of time steps A (finite) number of options, each of them equipped with a (unknown) proba distribution At each time step: - you choose one option - you get a reward, distributed according to its proba distribution At the end: - you choose one option (you can not change anymore...) - your reward is the expected reward associated to this option Here we collect information
  • 5. What is a bandit problem ? A finite number of time steps A (finite) number of options, each of them equipped with a (unknown) proba distribution At each time step: - you choose one option - you get a reward, distributed according to its proba distribution At the end: - you choose one option (you can not change anymore...) - your reward is the expected reward associated to this option Here we use information for the final choice
  • 6. What is a bandit problem ? A finite number of time steps A (finite) number of options, each of them equipped with a (unknown) proba distribution At each time step: - you choose one option - you get a reward, distributed according to its proba distribution At the end: - you choose one option (you can not change anymore...) - your reward is the expected reward associated to this option Here, we explore
  • 7. What is a bandit problem ? A finite number of time steps A (finite) number of options, each of them equipped with a (unknown) proba distribution At each time step: - you choose one option - you get a reward, distributed according to its proba distribution At the end: - you choose one option (you can not change anymore...) - your reward is the expected reward associated to this option Here, we take no risk
  • 8. What is a bandit problem ? A finite number of time steps A (finite) number of options, each of them equipped with a (unknown) proba distribution At each time step (exploration): - you choose one option - you get a reward, distributed according to its proba distribution At the end (recommendation): - you choose one option (you can not change anymore...) - your reward is the expected reward associated to this option
  • 9. Which kind of bandit ? - in the bandit literature, options are also termed “arms” - here the criterion is the expected reward of the option chosen at the end (sometimes it is the sum of the rewards during exploration) - we presented here stochastic bandits (a probability distribution per option) ==> next slide is different
  • 10. And adversarial bandit ? A finite number of time steps A (finite) number of options for player 1, and a finite number of options for player 2. An unknown probability distribution for each pair of options At each time step: - you choose one option for P1 and one option for P2 - you get a reward, distributed according to the corresponding proba distribution At the end: - you choose one **probabilistic** option for P1 (you can not change anymore...) - your reward is the expected reward associated to this option, for the worst choice by P2
  • 11. What is meta-gaming ? What is “strategic choice” ? Strategic choices: - decisions once and for all, at a high level - ≠ from tactical level Meta-gaming: choice at a strategic level, in games: - choosings cards, in card games - choosing handicap positioning, in Go ==> once and for all, at the beginning of the game
  • 12. Example of stochastic bandit (i.e. 1P strategic choice) Game of Go handicap bandit problem, at each time step: - you choose one handicap positioning - then you simulate one game from this position ==> only one player has a strategic choice ==> stochastic bandit
  • 13. Example of adversarial bandit (i.e. 2P strategic choice) Urban Rivals bandit problem, at each time step: - you choose - one set of cards for you (P1) - one set of cards for P2 - then you simulate one Urban Rivals game from this position PLAYER 1: PLAYER 2: ==> two players have a strategic choice ==> adversarial bandit
  • 14. Is a strategic bandit problem different from a classical bandit problem ? No difference in nature Just a much smaller budget
  • 15. Algorithms Reminder: - two algorithms needed: - one for choosing during N exploration steps - one for choosing during 1 recommendation step - two settings - one-player case - two-player case
  • 16. Algorithms for exploration Uniform: test all options uniformly Bernstein races: - uniformly among non discarded options, - discard options with statistical tests Successive reject: - uniformly among non discarded options, - discard periodically the worst option UCB: choose option with best average result + bonus for options weakly sampled, Adaptive-UCB-E: a variant of UCB aimed at removing hyper-parameters EXP3: empirically best option + random perturbation
  • 17. Algorithms for recommendation Empirically Best Arm: choose empirically best option Most Played Arm: choose most simulated option Successive reject:: the only non discarded option UCB: choose option with best average result + bonus for options weakly sampled. LCB: choose option with best average result + malus for options weakly sampled. Empirical distribution of play: an option has its frequency (during exploration) as probability (for recommendation) TEXP3: idem, but discard low probability options
  • 18. Experimental results Big boring tables of results are in the paper. Only a sample of most clear results here.
  • 19. One player case Killall Go stones positionning
  • 20. One player case Killall Go stones positionning Uncertainty should have malus in recommend.
  • 21. One player case Killall Go stones positionning EXP3 for 2player case
  • 22. Experimental results: TEXP3 outperforms EXP3 by far 2-player case, game = Urban Rivals (free online card game)
  • 23. Do you know killall-Go ? Black has stones in advance (e.g. 8 in 13x13). If white makes life, white wins. If black kills everything, black wins. Black choose stones positioning (strategic decisions).
  • 24. Left: human is Black and chooses E3 C4. Right: computer is Black and chooses D3 D5. White won both. Human said that the computer choice D3 D5 is good.
  • 25. Killall Go, H8 (left) H9 (right) Left: Human Pro Player (5P) as black has 8 handicap stones. White (computer) makes life and wins. Right: Human Pro Player (5P) as black has 9 handicap stones and kills everything and wins.
  • 26. CONCLUSIONS 1 player case: UCB for exploration, LCB or MPA for recommendation 2 player case: TEXP3 performs best. Killall-Go Win against pro with H2 in 7x7 Killall-Go as white. Loss against pro with H2 in 7x7 Killall-Go as black. 13x13: Computer won as white with H8, lost with H9. 13x13: Computer lost as black with H8 and with H9. Further work: Structured bandit: some options are close to each other. Batoo: Go with strategic choice for both players; nice test case. Industry: choosing investments for power grid simulations – in progress.