SUBGOAL LEARNING,                           MACRO-ACTIONS,                       PARTIAL OBSERVATION,                   CL...
MASH WP6 – Goal Planning             Controlling a 3D avatar                or a robot arm:  - without expert help  - with...
Category of problems●   MDP solving: you have access to the model●   Generative models:      –   Cases in which you can “u...
Goals of the project–   Adapting MCTS for such problems–   Parallel model-free MCTS–   Facilitating and Testing Crowd-Sour...
Outline●   What we have done       –   Extension to partially observable expensive “very” model-           free problems  ...
MCTS / UCT●   MCTS = UCT (nearly)●   Very good for high-dimensional problem with    little expertise●   Requires many simu...
Change #1: Macro-Actions (MA)●   With low-level decisions, actions should often    be repeated for being meaningfull●   Ex...
Change #2: Clustering Features●   Many state variables are very similar●   Clustering:      – Performsimulations      – Gr...
Change #3: memory●   Partially Observable problems require    memory●   Tree of subgoals:
Change #3: memory●   Partially Observable problems require    memory●   Tree of subgoals:                                 ...
Change #3: memory●   Partially Observable problems require    memory●   Tree of subgoals:                         Each    ...
Change #3: memory●   Partially Observable problems require    memory●   Tree of subgoals:                         Each    ...
Change #3: memory●   Partially Observable problems require    memory●               MCTS    Tree of subgoals:       is an ...
Summary CluVo + GMCTS:               all in one slide  1) Simulations, categorization of actions  2) Building of macro-act...
Other developments●   Q-learning●   Fitted Q-iteration●   Direct Policy Search    Main issue: representation (macro-action...
Results of Cluvo+GMCTS            on other WPs testcases●   Blue flag then red flag ok●   Looks easy, but in a fully agnos...
All you can eat: DPS could do it, with MA (no memory needed)
Blue Flag then Red Flag:Clustering ok for 3 out of 12 runs             8h learning   Generalization
Test on testcases from                 other WPs●   This has taken most of the manpower, easy    problems but with very di...
Results on the    game of Go (~8 contributors)●   automatic modifications of the bandit (moderate    success, far less eff...
Results on Urban Rivals      (18 millions of players, ~15       developpers on the core)●   Also partial observability, bu...
MineSweeper – building a code on   top of an existing heuristic       (1 existing code...)
Results on Energy management(~7/8 developers, high turn over)●   Existing solutions often very poor for short    term vola...
Results
Conclusions (1)●   MCTS adapted to partially observable    expensive agnostic settings        –   MCTS + all existing tric...
Conclusions (2)●   We tested on problems with external    developpers:    ●   Simu-based optimization make precise models ...
Perspectives●   The GMCTS / CLUVO program is stable and    able to work on very hard settings by combining    many state-o...
Publication●   Main MASH publication: JBHoock s paper    2012:    ●   Categorizing actions for automatically designing    ...
Publications●   Undecidability of adversarial planning with    unbounded horizon        Planning with:●   DPS: Convergence...
Publications●   Undecidability of adversarial planning with    unbounded horizon                              Planning wit...
Publications●   Undecidability of adversarial planning with    unbounded horizon●   DPS: Convergence rates of robust    op...
Publications●   Undecidability of adversarial planning with    unbounded horizon●   DPS: Convergence rates of robust    op...
Publications●   Undecidability of adversarial planning with    unbounded horizon●   DPS: Convergence rates of robust    op...
Publications●   Undecidability of adversarial planning with    unbounded horizon●   DPS: Convergence rates of robust    op...
Upcoming SlideShare
Loading in …5
×

reinforcement learning for difficult settings

236 views

Published on

mash project technical presentation, February 28th

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
236
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

reinforcement learning for difficult settings

  1. 1. SUBGOAL LEARNING, MACRO-ACTIONS, PARTIAL OBSERVATION, CLUSTERING OF FEATURES, and other stuff for difficult reinforcement learning settings.One slide out of topic, sorry :-) Yesterday I was particularly interested in the discussion arounddeep networks and convolution networks and Yann LeCun et al for computer vision, thanks :-) Seemingly computational power is a big part of computer vision, right ?
  2. 2. MASH WP6 – Goal Planning Controlling a 3D avatar or a robot arm: - without expert help - without model - without parallel runs - without knowing the target - with expensive runs - using existing human expertise if any, in away compliant with crowd-sourcing (human doesnot know the platform)
  3. 3. Category of problems● MDP solving: you have access to the model● Generative models: – Cases in which you can “undo” – Cases in which you can not The hardest reinforcement learning setting you can find, with expensive sims
  4. 4. Goals of the project– Adapting MCTS for such problems– Parallel model-free MCTS– Facilitating and Testing Crowd-Sourcing– Other methods for such problems
  5. 5. Outline● What we have done – Extension to partially observable expensive “very” model- free problems – Experiments on other WPs testbeds – Experiments on our testbeds● Parallelization● Conclusions● Perspectives
  6. 6. MCTS / UCT● MCTS = UCT (nearly)● Very good for high-dimensional problem with little expertise● Requires many simulations● Principle: ● Do simulations (plenty of) ● Adaptive decisions: first simulations with stupid strategies, and online improve the simulated strategy.
  7. 7. Change #1: Macro-Actions (MA)● With low-level decisions, actions should often be repeated for being meaningfull● Example with left-right: RRRRRR makes sense LLLLLLLL makes sense RLLLRRL makes no sense● Automatically categorize actions: eventually stationary, opposite, cyclic; define MA ==> state of the art + automatization
  8. 8. Change #2: Clustering Features● Many state variables are very similar● Clustering: – Performsimulations – Groups of correlated features● Strongly reduce the state space dimension
  9. 9. Change #3: memory● Partially Observable problems require memory● Tree of subgoals:
  10. 10. Change #3: memory● Partially Observable problems require memory● Tree of subgoals: I choose this action
  11. 11. Change #3: memory● Partially Observable problems require memory● Tree of subgoals: Each node contains a goal, i.e. features to be activated
  12. 12. Change #3: memory● Partially Observable problems require memory● Tree of subgoals: Each decisions node contains made by a goal, i.e. “voting”: MA features to be correlated activated with expected transitions
  13. 13. Change #3: memory● Partially Observable problems require memory● MCTS Tree of subgoals: is an Each decisions node contains extremely made by natural tool for a goal, i.e. “voting”: MA features to be correlated building subgoals activated with expected transitions
  14. 14. Summary CluVo + GMCTS: all in one slide 1) Simulations, categorization of actions 2) Building of macro-actions 3) Clustering of features 4) MCTS by simulations, correlations, voting: 1) Node creation as in MCTS, but node=subgoal 2) Simulations biased by rewards as in MCTS 3) Goals → votes → MA → decisions Vote: actions which statistically activatethe goal features, in the current state, are preferred
  15. 15. Other developments● Q-learning● Fitted Q-iteration● Direct Policy Search Main issue: representation (macro-actions, clusters of features). ==> Direct Policy Search ● also uses MA ● but needs a memory ==> GMCTS quite convenient / focusing simulations.
  16. 16. Results of Cluvo+GMCTS on other WPs testcases● Blue flag then red flag ok● Looks easy, but in a fully agnostic framework and thousands of variables it is not that easy.● Same algorithm performed correctly on “catch as many flags as possible”.● Combines many things of the state of the art: – Macro-actions – Subgoal learning – Clustering of features – MCTS / UCT
  17. 17. All you can eat: DPS could do it, with MA (no memory needed)
  18. 18. Blue Flag then Red Flag:Clustering ok for 3 out of 12 runs 8h learning Generalization
  19. 19. Test on testcases from other WPs● This has taken most of the manpower, easy problems but with very difficult setting● No crowd-sourcing● We have other testbeds with external developpers (same platform)
  20. 20. Results on the game of Go (~8 contributors)● automatic modifications of the bandit (moderate success, far less efficient than supervised learning of databases or expert handcrafting) ==> Maths for crowd-sourcing: – Automated regression testing by MSHT (good for crowd- sourcing) – Constraints on the way human enter expertise, for preserving consistency● automatic precomputing of moves (opening books) ==> both are quite parallel, but very expensive for moderate progress
  21. 21. Results on Urban Rivals (18 millions of players, ~15 developpers on the core)● Also partial observability, but information fully revealed frequently● Easy to simulate● MCTS was great for this application: – No human expertise needed – Consistent independently of human expertise● Solved the problem, whereas many engineers failed
  22. 22. MineSweeper – building a code on top of an existing heuristic (1 existing code...)
  23. 23. Results on Energy management(~7/8 developers, high turn over)● Existing solutions often very poor for short term volatility and high-dimension of the state space● Simulation-based approaches: rigorous use of cross-validation, detailed non-simplified simulations● MCTS + DPS (for choosing the default policy): stable and efficient
  24. 24. Results
  25. 25. Conclusions (1)● MCTS adapted to partially observable expensive agnostic settings – MCTS + all existing tricks from the state of the art – Integration into MCTS probably more natural than in many algorithms (in particular subgoal learning) – big implementation and experimentation work; more publications to come● An unexpected positive result: ● Merge between two simulation-based tools, DPS (long term effects) and MCTS (short term effects) ● Quite natural, highly parallel ● Virtually no model bias ● Really efficient in stochastic setting (not adversarial)
  26. 26. Conclusions (2)● We tested on problems with external developpers: ● Simu-based optimization make precise models possible ● Interface with humans: Automatic non-regression testing / constraints for consistency / interface for using human knowledge ok ● WP problems did not motivate alternate developers, but principles could be tested on other testcases● No real crowd-sourcing, but some moderate teams of motivated developpers ==> easier● Principles developped for the platform are re- used for an industrial platform
  27. 27. Perspectives● The GMCTS / CLUVO program is stable and able to work on very hard settings by combining many state-of-the-art techniques, can have a long life; no crowd-sourcing● The application of simulation based methods is efficient, compliant with non-linear stochastic dynamics and parallel ==> validated for industrialization in energy management● MCTS variants for partially observable settings are ok also far from the hard Mash setting
  28. 28. Publication● Main MASH publication: JBHoock s paper 2012: ● Categorizing actions for automatically designing macro-actions ● Clustering of features ● GMCTS for building subgoals● Many ideas in it, its a big part of his ph.D. in one article.
  29. 29. Publications● Undecidability of adversarial planning with unbounded horizon Planning with:● DPS: Convergence rates -of robust uncertainties adversarial - finite state space optimization / noisy optimization - no observation● - deterministic problem Parallel MCTS / nested MCTS● MCTS in continuous settings Optimal average reward: - undecidable● MCTS for PO setting (real-world: Urban - unapproximable Rivals)● Model-free MCTS● Hybridization MCTS/DPS● Simulation-based optimization in power systems
  30. 30. Publications● Undecidability of adversarial planning with unbounded horizon Planning with:● DPS: Convergence rates -of robust uncertainties adversarial optimization / noisy optimization - finite state space - partially observation● Parallel MCTS / nested MCTS - stochastic problem● MCTS in continuous settings all strategies, - for stops almost surely● MCTS for PO setting (real-world: Urban Rivals) Optimal average reward● Model-free MCTS is decidable● Hybridization MCTS/DPS● Simulation-based optimization in power systems
  31. 31. Publications● Undecidability of adversarial planning with unbounded horizon● DPS: Convergence rates of robust optimization / noisy optimization● Parallel MCTS / nested MCTS● MCTS in continuous settings Optimal rates in the parallel case● MCTS for POfor robust(real-world: Urban Rivals) setting optimization w.r.t.● Model-free MCTSmonotonous compositions ==> essentially, bounds and● Hybridization MCTS/DPS evolutionary patches for computation● Simulation-based optimization in power systems
  32. 32. Publications● Undecidability of adversarial planning with unbounded horizon● DPS: Convergence rates of robust optimization / noisy optimization● Parallel MCTS / nested MCTS● MCTS in continuous settings Optimal rates for noisy quadratic● MCTS for PO setting optimization with black-box (real-world: Urban Rivals)● Model-free MCTS linear in the regret variance● Hybridization MCTS/DPS● Simulation-based optimization in power systems
  33. 33. Publications● Undecidability of adversarial planning with unbounded horizon● DPS: Convergence rates of robust optimization / noisy optimization● Parallel MCTS / nested MCTS● MCTS in continuous settings● MCTS for PO setting (real-world: Urban Rivals)● Model-free MCTS Consistency proof in the● continuous case Hybridization MCTS/DPS● Simulation-based optimization in power systems
  34. 34. Publications● Undecidability of adversarial planning with unbounded horizon● DPS: Convergence rates of robust optimization / noisy optimization● Parallel MCTS / nested MCTS● MCTS in continuous settings● MCTS for PO setting (real-world: Urban Rivals)● Model-free MCTS● Hybridization MCTS/DPS● Simulation-based optimization in power systems

×