Simple regret bandit algorithms for unstructured noisy optimization

710 views

Published on

Bandit algorithms can be good in optimization, when the domain is unstructured. Simple regret is then the relevant regret criterion.


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
710
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Simple regret bandit algorithms for unstructured noisy optimization

  1. 1. Parameter tuning, bandit algorithmsParameter tuning bybandits algorithmsMadeira,June 2010.
  2. 2. OutlineParameter tuningBanditsOur testbedResults
  3. 3. Parameter tuning is optimizationParameter tuning Optimization: I have a function f. I want its global minimum f*, i.e. f(f*) ≤ f(x) for all x Expensive optimization Parameter tuning
  4. 4. Expensive optimizationParameter tuning Optimization: I have a function f. I want its global minimum f*, i.e. f(f*) ≤ f(x) for all x Expensive optimization f is an expensive function takes hours of computation, or hours of cluster-based computation, or hours of human work. or maybe is not a function. Parameter tuning
  5. 5. Parameter tuning is greatParameter tuning Expensive optimization f is an expensive function takes hours of computation, or hours of cluster-based computation, or hours of human work. or maybe is not a function. Parameter tuning f is the testing of a program under given coefficients. crucial in many applications ==> expensive optimization
  6. 6. OutlineParameter tuningBanditsOur testbedResults
  7. 7. A ``bandit problemp1,...,pN unknown probabilities ∈ [0,1]At each time step i∈ [1,n]choose ui∈ {1,...,N} (as a function of uj and rj, j<i)With probability puiwin ( ri=1 )loose ( ri=0 )
  8. 8. A ``bandit problem: the targetp1,...,pN unknown probabilities ∈ [0,1]At each time step i∈ [1,n]choose ui∈ {1,...,N} (as a function of uj and rj, j<i)With probability puiwin ( ri=1 )loose ( ri=0 )Regret: Rn=n max{pi} - ∑ rj (j<n)How to minimize the regret (worst case on p) ?
  9. 9. Bandits – a classical solutionRegret: Rn=n max{pi} - ∑ rj (j<i)UCB1: Choose u maximizing the compromise: Empirical average for decision u + √( log(i)/ number of trials with decision u ) ==> optimal regret O(log(n)) (Lai et al; Auer et al)
  10. 10. Infinite bandit: progressivewideningUCB1: Choose u maximizing the compromise: Empirical average for decision u + √( log(iteration)/ nb of trials with decision u )  ==> argmax only on the i first arms ( [ 0.25 0.5 ] ) (Coulom, Chaslot et al, Wang et al)
  11. 11. Bandits: much moreWhat is a bandit: - a criterion (here the regret) defines the problem - usually a score (typically exploration+exploitation) defines the bandit algorithm ==> an optimal score for a criterion is not optimal for another ==> a wide literature ==> centered on finite time analysis
  12. 12. Simple regretWhat is “simple regret” ?Regret = cumulated regretSimple regret: = expected value – optimal expected value.==> optimization “on average”==> no structure on the search space !==> what are good algorithms for this criterion ?
  13. 13. Simple regretWhat is “simple regret” ?Regret = cumulated regretSimple regret: = expected value – optimal expected value.==> optimization “on average”==> no structure on the search space !==> what are good algorithms for this criterion ?
  14. 14. Simple regretWhat is “simple regret” ?Regret = cumulated regretSimple regret: = expected value – optimal expected value.==> optimization “on average”==> no structure on the search space !==> what are good algorithms for this criterion ?
  15. 15. Bubeck et al: uniform is optimal! n linear as a function of K.
  16. 16. Bubeck et al: uniform is optimal!For a fixed regret, n linear as a function of K.Regret exponential as a function of n.
  17. 17. Bubeck et al: uniform is optimal!For a fixed regret, n linear as a function of K.Regret exponential as a function of n.
  18. 18. Bubeck et al: uniform is optimal! Yet, non-asymptotically, UCB is much better. (see also ``successive reject algorithm)For a fixed regret, n linear as a function of K.Regret exponential as a function of n.
  19. 19. OutlineParameter tuningBanditsOur testbedResults
  20. 20. Our testbedMonte-Carlo Tree SearchWhy should you care about Monte-CarloTree Search ? Recent algorithm (2006) Very impressive in computer-Go Now invading discrete time control, games, difficult planning (high dimensional cases!).
  21. 21. Our testbedMonte-Carlo Tree SearchWhy should you care about Monte-CarloTree Search ? RecentA take-home message here: algorithm (2006) MCTS is not yet very well Very impressive in computer-Go known. Now invading its a really great but discrete time control, games, difficult planning (high dimensional cases!). algorithm, in my humble opinion.
  22. 22. Go: from 29 to 6 stones1998: loss against amateur (6d) 19x19 H292008: win against a pro (8p) 19x19, H9 MoGo2008: win against a pro (4p) 19x19, H8 CrazyStone2008: win against a pro (4p) 19x19, H7 CrazyStone2009: win against a pro (9p) 19x19, H7 MoGo2009: win against a pro (1p) 19x19, H6 MoGo2007: win against a pro (5p) 9x9 (blitz) MoGo2008: win against a pro (5p) 9x9 white MoGo2009: win against a pro (5p) 9x9 black MoGo2009: win against a pro (9p) 9x9 white Fuego2009: win against a pro (9p) 9x9 black MoGoTW==> still 6 stones at least!
  23. 23. All good Go: from 29 to 6 stones results With MCTS1998: loss against amateur (6d) 19x19 H292008: win against a pro (8p) 19x19, H9 MoGo2008: win against a pro (4p) 19x19, H8 CrazyStone2008: win against a pro (4p) 19x19, H7 CrazyStone2009: win against a pro (9p) 19x19, H7 MoGo2009: win against a pro (1p) 19x19, H6 MoGo2007: win against a pro (5p) 9x9 (blitz) MoGo2008: win against a pro (5p) 9x9 white MoGo2009: win against a pro (5p) 9x9 black MoGo2009: win against a pro (9p) 9x9 white Fuego2009: win against a pro (9p) 9x9 black MoGoTW==> still 6 stones at least!
  24. 24. UCT (Upper Confidence Trees)Coulom (06)Chaslot, Saito & Bouzy (06)Kocsis Szepesvari (06)
  25. 25. UCT
  26. 26. UCT
  27. 27. UCT
  28. 28. UCT
  29. 29. UCT Kocsis & Szepesvari (06)
  30. 30. Exploitation ...
  31. 31. Exploitation ... SCORE = 5/7 + k.sqrt( log(10)/7 )
  32. 32. Exploitation ... SCORE = 5/7 + k.sqrt( log(10)/7 )
  33. 33. Exploitation ... SCORE = 5/7 + k.sqrt( log(10)/7 )
  34. 34. ... or exploration ? SCORE = 0/2 + k.sqrt( log(10)/2 )
  35. 35. Monte-Carlo Tree Searchparallelization multi-core message passing tentative: GPGPUfrugality + consistencyautomatic parameter tuning (≃ expensive optimization - DPS)combination expert rules + supervised learning learning-based Monte-Carlo patterns (non-regression GP)applications far from games (unstructured pbs) active learning + non-linear optimization Spiral starting: energy + robotics
  36. 36. OutlineParameter tuningBanditsOur testbedResults
  37. 37. A bit of theory: Bonferronistatistical confidence boundsA statistical test is as follows: - I run n tests - I average the results - I choose a risk level  - Statistical test: A little calculus (statistical test) says that I have precision 
  38. 38. A bit of theory: Bonferronistatistical confidence boundsA statistical test is as follows: - I run n tests - I average the results If there are K tests, - I choose a risk level  - Statistical test: my probability of being wrong A little calculus (statistical test) says that I have precision  is multiplied by K!
  39. 39. A bit of theory: Bonferronistatistical confidence boundsA statistical test is as A statistical test with Bonferroni correction follows: - I run n tests is as follows: - I average the results- I run n tests for each of K cases - I choose a risk level- average the results I - Statistical test: /K - I choose a risk level  A little calculus (statistical test) says -  that I have precisionStatistical test: A little calculus (statistical test) says that I have precision 
  40. 40. A bit of theory: Bonferroni statistical confidence boundsIf yes, uniform sampling should be ok, otherwise try UCB (and be lucky).
  41. 41. Results Weve been lucky. One of our test cases gives “yes”, the other gives “no”. We did not even have to cheat for having this. First case: blitz games, nearly “fast” Xps. Second case: real games, very expensive optimization.If yes, uniform sampling should be ok, otherwise try UCB (and be lucky).
  42. 42. Results Weve been lucky. One of our test cases gives “yes”, the other gives “no”. We did not even have to cheat for having this. Here the uniform sampling is First case: blitz games, nearly “fast” Xps. really convincing. Second case: real games, very expensive optimization.If yes, uniform sampling should be ok, otherwise try UCB (and be lucky).
  43. 43. Results Weve been lucky. One of our test cases Here the uniformother givesis gives “yes”, the sampling “no”. moderately convincing. We did not even have to cheat for having this. First case: blitz games, nearly empirical best was Yet, the “fast” Xps. Second case: real (a better with UCB than with uniform) ok games, very expensive optimization.If yes, uniform sampling should be ok, otherwise try UCB (and be lucky).
  44. 44. ConclusionWhat do we propose ?- A simple mathematically derived rule for predicting if we can trust uniform sampling. * If the answer is “yes”, then easy parallelization, and statistical validation is naturally included (Bonferroni correction). * If the answer is “no” uniform cant work. Maybe UCB or successive reject. Be lucky.- Xps on MCTS (subliminal message: MCTS is great).- Remark: take care of Bonferroni corrections. Much better for the regression testing of non-deterministic codes/data.

×