SlideShare a Scribd company logo
1 of 76
Simulations for combining
heuristics and consistent
algorithms

Applications to Minesweeper,
the game of Go
and Power Grids
O. Buffet, A. Couëtoux, H. Doghmen,
W. Lin, O. Teytaud,
& many others
In France   (RTE+Luc Lasne)
Beautiful spatially
                distributed problem
●   Short term (~10s): dispatching
    ●   Real-time control, humans in the loop
●   Days, weeks: combinatorial optimization
●   Years: hydroelectric stocks
    ●   Stochasticity (provides price of water for week-
        ahead)
●   50 years: investments
    ●   Expensive optimization of strategies (parallel)
    ●   Uncertainties: Multiobjective (many!) ? Worst-
        case ?
Goals / tools
●   Optimizing investments
       –    For the next 50 years
       –    In Europe / North Africa
●   Taking into account power plants / networks
●   Multi-objective (scenarios), visualization


    with:
●   Collaboration with a company (data, models)
●   3 ph.D. students full time
●   dedicated machine 500/1000 cores
Main ideas
●   We like simulation-based optimizers
    ●   For analyzing simulations
    ●   For bilevel optimization (“anytime” criterion: smooth
        performance improvement)
●   Required knowledge is a simulator:
    ●   No dependency on additional knowledge
    ●   Simplified model (linearized...) not necessary
●   But we want to be able to plug expertise in
    terms of strategy (e.g. handcrafted approximate
    policy)
Main ideas
●   We like simulation-based optimizers
    ●   For analyzing simulations
    ●   For bilevel optimization (“anytime” criterion: smooth performance improvement)
●   All we want as required knowledge is a simulator:
    ●   No dependency on additional knowledge
    ●   No simplified model
●   But we want to be able to plug expertise in terms of strategy (e.g.
    handcrafted approximate policy

    Tools:
●   Upper Confidence Tree = Adaptive Simulator, good for
    combinatorial aspects
●   Direct Policy Search = Adaptive Simulator, good for long term
    effects
A great challenge: MineSweeper.

- looks easy
- in fact, not easy:
    many myopic (one-
   step-ahead)
   approaches.
- partially observable
1. Rules of MineSweeper

    2. State of the art

  3. The CSP approach

  4. The UCT approach

5. The best of both worlds
RULES



    At the
 beginning,
      all
  locations
     are
  Covered
(unkwown).
I play
here!
Good news!

 No mine in
     the
neighborhood!

 I can “click”
    all the
 neighbours.
I have 3
  uncovered
  neighbors,
 and I have 3
 mines in the
neighborhood
 ==> 3 flags!
I know
  it's a
 mine,
so I put
 a flag!
No info !
I play here and I lose...
The most
successful
game ever!
Who in this
room never
  played
   Mine-
Sweeper ?
1. Rules of MineSweeper

   2. State of the art

  3. The CSP approach

  4. The UCT approach

5. The best of both worlds
Do you
 think it's
  easy ?
 (10 mines)

MineSweeper
is not simple.
What is
the optimal
  move ?
What is
                                        the optimal
                                          move ?


 Remark: the question makes sense, without
             Knowing the history.
You don't need the history for playing optimaly.
 ==> (this fact is mathematically non trivial!)
What is
                                     the optimal
                                       move ?



             This one is easy.

Both remaining locations win with proba 50%.
More
difficult!
 Which
move is
optimal ?

Here, the
classical
approach
  (CSP)
is wrong.
Probability
   of a mine ?
- Top:
- Middle:
- Bottom:
Probability
   of a mine ?
- Top: 33%
- Middle:
- Bottom:
Probability
   of a mine ?
- Top: 33%
- Middle: 33%
- Bottom:
Probability
   of a mine ?
- Top: 33%
- Middle: 33%
- Bottom: 33%
Probability
    of a mine ?
- Top: 33%
- Middle: 33%
- Bottom: 33%

==> so all moves
    equivalent ?
Probability
    of a mine ?
- Top: 33%
- Middle: 33%
- Bottom: 33%

==> so all moves
    equivalent ?
==> NOOOOO!!!
Probability
    of a mine ?
- Top: 33%
- Middle: 33%
- Bottom: 33%

Top or bottom:
  66% of win!

Middle: 33%!
The myopic
(one-step ahead)
 approach plays
   randomly.

 The middle is a
   bad move!

 Even with same
  proba of mine,
 some moves are
better than others!
State of the art:
- solved in 4x4
- NP-complete
- Constraint Satisfaction Problem approach:
    = Find the location which is less likely
        to be a mine, play there.
  ==> 80% success “beginner” (9x9, 10 mines)
  ==> 45% success “intermediate” (16x16, 40
                                           mines)
  ==> 34% success “expert” (30x40, 99 mines)
1. Rules of MineSweeper

       2. State of the art

     3. The CSP approach
(and other old known methods)

     4. The UCT approach

   5. The best of both worlds
- Exact MDP: very expensive. 4x4 solved.
- Single Point Strategy (SPS): simple local solving
- CSP (constraint satisf. problem): the main approach.
    - (unknown) state:
          x(i) = 1 if there is a mine at location i
    - each visible location is a constraint:
           If location 15 is labelled 4, then the constraint is
           x(04)+x(05)+x(06)
          +x(14)+         x(16)
          +x(24)+x(25)+x(26) = 4.
    - find all solutions x1, x2, x3,...,xN
    - P(mine in j) = (sumi Xij ) / N <== this is math. proved!
    - play j such that P(mine in j) minimal
    - if several such j, randomly break ties.

                MDP= Markov Decision Process
              CSP = Constraint Satisfaction Problem
CSP as modified by Legendre et al, 2012:

   - (unknown) state:
         x(i) = 1 if there is a mine at location i
   - each visible location is a constraint:
          If location 15 is 4, then the constraint is
          x(04)+x(05)+x(06)
         +x(14)+         x(16)
         +x(24)+x(25)+x(26) = 4.
   - find all solutions x1, x2, x3,...,xN
   - P(mine in j) = (sumi Xij ) / N <== this is math. proved!
   - play j such that P(mine in j) minimal
   - if several such j, choose one “closest to the frontier”
                        (proposed by Legendre et al)
   - if several such j, randomly break ties.
CSP
- is very fast
- but it's not optimal
- because of




Here CSP plays randomly!
Also for the initial move: don't play
 randomly the first move!   (sometimes opening book)
1. Rules of MineSweeper

    2. State of the art

  3. The CSP approach

 4. The UCT approach

5. The best of both worlds
The MCTS approach

●   Random simulations
                             (Bruegman, 93)



●   A tree of possible futures, increasing
    along simulations
         (several simultaneous papers, 2006)
Why not UCT ?
- looks like a stupid idea at first view
- can not compete with CSP in terms of speed
- But at least UCT is
  consistent: if given
  sufficient
  time, it will play
  optimally.
- Tested in Couetoux
  and Teytaud, 2011
UCT (Upper Confidence Trees)




Coulom (06)
Chaslot, Saito & Bouzy (06)
Kocsis Szepesvari (06)
UCT
UCT
UCT
UCT
UCT
      Kocsis & Szepesvari (06)
Exploitation ...
Exploitation ...
            SCORE =
                5/7
             + k.sqrt( log(10)/7 )
Exploitation ...
            SCORE =
                5/7
             + k.sqrt( log(10)/7 )
Exploitation ...
            SCORE =
                5/7
             + k.sqrt( log(10)/7 )
... or exploration ?
              SCORE =
                  0/2
               + k.sqrt( log(10)/2 )
UCT in one slide




Great progress in the game of Go and in various other games
UCT in one slide



            C SP by
     se the al 2012
We u re et
      d
Legen expansion
   for      ulation
                   .
   a nd sim
Applying UCT here ?
•   Might look like ``too much''
•   But in many cases CSP is suboptimal
•   We have seen an example of suboptimal
    move by CSP a few slides ago
•   Let's see two additional examples
An example showing that the initial
move matters (UCT finds it, not CSP)..

                              3x3, 7 mines:
                            the optimal move
                       is anything but the center.
                      Optimal winning rate: 25%.
                        Optimal winning rate if
                           random uniform
                         initial move: 17/72.

                           (yes we get 1/72
                            improvement!)
Second such example:
       15 mines on 5x5 board with
                GnoMine rule
      (i.e. initial move is a 0, i.e. no
        mine in the neighborhood)
           Optimal success rate = 100%!!!!!
Play the center, and you win (well, you have to work...)
      The myopic CSP approach does not find it.
1. Rules of MineSweeper

    2. State of the art

  3. The CSP approach

  4. The UCT approach

5. The best of both worlds
Summary
    I have two approaches:
●   CSP:

       ●   Fast

       ●   Suboptimal (myopic, only 1-step ahead)

●   UCT:

       ●   needs a generative model (probability of next
           states, given my action),

       ●   Asymptotically optimal
The best of both worlds ?

●   CSP:

       ●   Fast

       ●   Suboptimal (myopic, only 1-step ahead)

●   UCT:

       ●   needs a generative model by CSP,

       ●   Asymptotically optimal
What do I need for implementing UCT ?
A complete generative model.
Given a state and an action,
I must be able to simulate possible transitions.
State S, Action a:
(S,a) ==> S'
Example: given the state below, and the action “top left”, what
are the possible next states ?
What do I need for implementing UCT ?

A complete generative model.
Given a state and an action,
I must be able to simulate possible transitions.

State S, Action a:
(S,a) ==> S'

Example: given the state below, and the action “top left”, what are the possible next
states ?
What do I need for implementing UCT ?

A complete generative model.
Given a state and an action,
I must be able to simulate possible transitions.

State S, Action a:
(S,a) ==> S'

Example: given the state below, and the action “top left”, what are the possible next
states ?
What do I need for implementing UCT ?

A complete generative model.
Given a state and an action,
I must be able to simulate possible transitions.

State S, Action a:
(S,a) ==> S'

Example: given the state below, and the action “top left”, what are the possible next
states ?
What do I need for implementing UCT ?

A complete generative model.
Given a state and an action,
I must be able to simulate possible transitions.

State S, Action a:
(S,a) ==> S'

Example: given the state below, and the action “top left”, what are the possible next
states ?
We published a version of UCT
       for MineSweeper in which this was
What do I need for implementing UCT ?


                        implemented using
A complete generative model.
Given a state and an action,
                 the rejection method only.
I must be able to simulate possible transitions.

State S, Action a:
(S,a) ==> S'

Example: given the state below, and the action “top left”, what are the possible next
states ?
Rejection algorithm:
      1- randomly draw the mines
What do I need for implementing UCT ?


Given 2- if and an action, return the new observation
       a state it's ok,
A complete generative model.


      3- otherwise, go back to 1.
I must be able to simulate possible transitions.

State S, Action a:
(S,a) ==> S'

Example: given the state below, and the action “top left”, what are the possible next
states ?
It is mathematically ok, but it is too slow.
Then,need for used a UCT ? CSP implementation.
What do I
            we implementing weak
A complete generative model.
Given a state and an action,
                               Still too slow.
Now a reasonably fast implementation, with
I must be able to simulate possible transitions.

State S, Action a:
(S,a) ==> S'
                    Legendre et al heuristic.
Example: given the state below, and the action “top left”, what are the possible next
states ?
EXPERIMENTAL RESULTS




                            Huge
                         computation
10 000 UCT-simulations      time
       per move                      Our results
                                  (total = a few days)
CONCLUSIONS: a
       methodology for sequential
           decision making

- When you have a myopic solver
  (i.e. which neglects long term
  effects, as too often in industry!)
     ==> improve it with heuristics (as
            Legendre et al)
     ==> combine with UCT (as we did)
     ==> significant improvements

- We have similar experiments on
   industrial testbeds
Main ideas
●   We like simulation-based optimizers
    ●   For analyzing simulations
    ●   For bilevel optimization (“anytime” criterion: smooth performance improvement)
●   All we want as required knowledge is a simulator:
    ●   No dependency on additional knowledge
    ●   No simplified model
●   But we want to be able to plug expertise in terms of strategy (e.g.
    handcrafted approximate policy

    Tools:
●   Upper Confidence Tree = Adaptive Simulator, good for
    combinatorial aspects
●   Direct Policy Search = Adaptive Simulator, good for long term
    effects
What is Direct Policy Search ?
●   I have a parametric policy decision = p(w,s)
        –   Inputs: parameter w, state s
        –   Output: decision p(w,s)
        –   E.g.
              ●    p(w,s) = w.s (scalar product)
              ●    p(w,s) = W0 + W1 x tanh(W2 x s+W3)            (neural network)
●   I have a simulator cost = simulate(p, transition):
        –   Inputs = policy p , transition function
        –   Output = cost (possibly noisy)
        –   Principle:
                     –   state = initial state
                     –   While (state not final)                                              of
                            ● decision=p(state)
                                                                                         p art .
                              state=transition(state,decision)
                                                                                       ig ork
                                                                                    A b ur w
                            ●


●   Direct Policy Search( transition , policy p(.,.) ):
                                                                                       o
        –   w = argmin simulate( p(w,.) )
                        // with your favorite optimization algorithm
        –   return p(w,.)
Example of policy
    p(w,state) = decision     such that
●   Alpha = W0+W1tanh(W2.state)+W3
●   Week-Ahead-Reward(decision) +
          Alpha.Stock(decision) is maximum


    ==> if linear transition, compliant with huge
             action dimension
    ==> non-linearities handled by the neural net
One or two rivers (7 stocks)
Network of rivers (7 rivers)
Investment problem
Summary

    Two simulation-based tools for Sequential Decision
    Making:
●   UCT = a nice tool for short term combinatorial effect
●   DPS = a stable tool for long term effects


    Both:
●   Are anytime
●   Provide simulation results
●   Can take into account non-linear effects
●   High-dimension of the state space
Thanks for your
attention!

    9 Mines.
  What is the
optimal move ?

More Related Content

Viewers also liked

Bias and Variance in Continuous EDA: massively parallel continuous optimization
Bias and Variance in Continuous EDA: massively parallel continuous optimizationBias and Variance in Continuous EDA: massively parallel continuous optimization
Bias and Variance in Continuous EDA: massively parallel continuous optimizationOlivier Teytaud
 
Artificial intelligence for power systems
Artificial intelligence for power systemsArtificial intelligence for power systems
Artificial intelligence for power systemsOlivier Teytaud
 
Examples of operational research
Examples of operational researchExamples of operational research
Examples of operational researchOlivier Teytaud
 
Monte Carlo Tree Search in 2014 (MCMC days in Marseille)
Monte Carlo Tree Search in 2014 (MCMC days in Marseille)Monte Carlo Tree Search in 2014 (MCMC days in Marseille)
Monte Carlo Tree Search in 2014 (MCMC days in Marseille)Olivier Teytaud
 
Simple regret bandit algorithms for unstructured noisy optimization
Simple regret bandit algorithms for unstructured noisy optimizationSimple regret bandit algorithms for unstructured noisy optimization
Simple regret bandit algorithms for unstructured noisy optimizationOlivier Teytaud
 
Réseaux neuronaux profonds & intelligence artificielle
Réseaux neuronaux profonds & intelligence artificielleRéseaux neuronaux profonds & intelligence artificielle
Réseaux neuronaux profonds & intelligence artificielleOlivier Teytaud
 
Disappointing results & open problems in Monte-Carlo Tree Search
Disappointing results & open problems in Monte-Carlo Tree SearchDisappointing results & open problems in Monte-Carlo Tree Search
Disappointing results & open problems in Monte-Carlo Tree SearchOlivier Teytaud
 

Viewers also liked (11)

Bias and Variance in Continuous EDA: massively parallel continuous optimization
Bias and Variance in Continuous EDA: massively parallel continuous optimizationBias and Variance in Continuous EDA: massively parallel continuous optimization
Bias and Variance in Continuous EDA: massively parallel continuous optimization
 
Artificial intelligence for power systems
Artificial intelligence for power systemsArtificial intelligence for power systems
Artificial intelligence for power systems
 
Examples of operational research
Examples of operational researchExamples of operational research
Examples of operational research
 
Direct policy search
Direct policy searchDirect policy search
Direct policy search
 
Monte Carlo Tree Search in 2014 (MCMC days in Marseille)
Monte Carlo Tree Search in 2014 (MCMC days in Marseille)Monte Carlo Tree Search in 2014 (MCMC days in Marseille)
Monte Carlo Tree Search in 2014 (MCMC days in Marseille)
 
Functional programming
Functional programmingFunctional programming
Functional programming
 
Power systemsilablri
Power systemsilablriPower systemsilablri
Power systemsilablri
 
Simple regret bandit algorithms for unstructured noisy optimization
Simple regret bandit algorithms for unstructured noisy optimizationSimple regret bandit algorithms for unstructured noisy optimization
Simple regret bandit algorithms for unstructured noisy optimization
 
Debugging
DebuggingDebugging
Debugging
 
Réseaux neuronaux profonds & intelligence artificielle
Réseaux neuronaux profonds & intelligence artificielleRéseaux neuronaux profonds & intelligence artificielle
Réseaux neuronaux profonds & intelligence artificielle
 
Disappointing results & open problems in Monte-Carlo Tree Search
Disappointing results & open problems in Monte-Carlo Tree SearchDisappointing results & open problems in Monte-Carlo Tree Search
Disappointing results & open problems in Monte-Carlo Tree Search
 

Similar to Simulation-based optimization: Upper Confidence Tree and Direct Policy Search

Optimization of power systems - old and new tools
Optimization of power systems - old and new toolsOptimization of power systems - old and new tools
Optimization of power systems - old and new toolsOlivier Teytaud
 
Tools for Discrete Time Control; Application to Power Systems
Tools for Discrete Time Control; Application to Power SystemsTools for Discrete Time Control; Application to Power Systems
Tools for Discrete Time Control; Application to Power SystemsOlivier Teytaud
 
Teaching Constraint Programming, Patrick Prosser
Teaching Constraint Programming,  Patrick ProsserTeaching Constraint Programming,  Patrick Prosser
Teaching Constraint Programming, Patrick ProsserPierre Schaus
 
Understanding Basics of Machine Learning
Understanding Basics of Machine LearningUnderstanding Basics of Machine Learning
Understanding Basics of Machine LearningPranav Ainavolu
 
2021 1학기 정기 세미나 2주차
2021 1학기 정기 세미나 2주차2021 1학기 정기 세미나 2주차
2021 1학기 정기 세미나 2주차Moonki Choi
 
AlphaZero and beyond: Polygames
AlphaZero and beyond: PolygamesAlphaZero and beyond: Polygames
AlphaZero and beyond: PolygamesOlivier Teytaud
 
Playing Go with Clojure
Playing Go with ClojurePlaying Go with Clojure
Playing Go with Clojureztellman
 
Useing PSO to optimize logit model with Tensorflow
Useing PSO to optimize logit model with TensorflowUseing PSO to optimize logit model with Tensorflow
Useing PSO to optimize logit model with TensorflowYi-Fan Liou
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfssuseradaf5f
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
Silverdisappointing8 120924091642-phpapp01
Silverdisappointing8 120924091642-phpapp01Silverdisappointing8 120924091642-phpapp01
Silverdisappointing8 120924091642-phpapp01David Robles
 
Practical AI for Business: Bandit Algorithms
Practical AI for Business: Bandit AlgorithmsPractical AI for Business: Bandit Algorithms
Practical AI for Business: Bandit AlgorithmsSC5.io
 
Heuristic approach optimization
Heuristic  approach optimizationHeuristic  approach optimization
Heuristic approach optimizationAng Sovann
 
constructing_generic_algorithms__ben_deane__cppcon_2020.pdf
constructing_generic_algorithms__ben_deane__cppcon_2020.pdfconstructing_generic_algorithms__ben_deane__cppcon_2020.pdf
constructing_generic_algorithms__ben_deane__cppcon_2020.pdfSayanSamanta39
 
Meta Monte-Carlo Tree Search
Meta Monte-Carlo Tree SearchMeta Monte-Carlo Tree Search
Meta Monte-Carlo Tree SearchOlivier Teytaud
 
Scott Clark, Software Engineer, Yelp at MLconf SF
Scott Clark, Software Engineer, Yelp at MLconf SFScott Clark, Software Engineer, Yelp at MLconf SF
Scott Clark, Software Engineer, Yelp at MLconf SFMLconf
 

Similar to Simulation-based optimization: Upper Confidence Tree and Direct Policy Search (20)

Optimization of power systems - old and new tools
Optimization of power systems - old and new toolsOptimization of power systems - old and new tools
Optimization of power systems - old and new tools
 
Tools for Discrete Time Control; Application to Power Systems
Tools for Discrete Time Control; Application to Power SystemsTools for Discrete Time Control; Application to Power Systems
Tools for Discrete Time Control; Application to Power Systems
 
Teaching Constraint Programming, Patrick Prosser
Teaching Constraint Programming,  Patrick ProsserTeaching Constraint Programming,  Patrick Prosser
Teaching Constraint Programming, Patrick Prosser
 
Understanding Basics of Machine Learning
Understanding Basics of Machine LearningUnderstanding Basics of Machine Learning
Understanding Basics of Machine Learning
 
2021 1학기 정기 세미나 2주차
2021 1학기 정기 세미나 2주차2021 1학기 정기 세미나 2주차
2021 1학기 정기 세미나 2주차
 
Ucb
UcbUcb
Ucb
 
AlphaZero and beyond: Polygames
AlphaZero and beyond: PolygamesAlphaZero and beyond: Polygames
AlphaZero and beyond: Polygames
 
Playing Go with Clojure
Playing Go with ClojurePlaying Go with Clojure
Playing Go with Clojure
 
Useing PSO to optimize logit model with Tensorflow
Useing PSO to optimize logit model with TensorflowUseing PSO to optimize logit model with Tensorflow
Useing PSO to optimize logit model with Tensorflow
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdf
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Silverdisappointing8 120924091642-phpapp01
Silverdisappointing8 120924091642-phpapp01Silverdisappointing8 120924091642-phpapp01
Silverdisappointing8 120924091642-phpapp01
 
Practical AI for Business: Bandit Algorithms
Practical AI for Business: Bandit AlgorithmsPractical AI for Business: Bandit Algorithms
Practical AI for Business: Bandit Algorithms
 
Heuristic approach optimization
Heuristic  approach optimizationHeuristic  approach optimization
Heuristic approach optimization
 
constructing_generic_algorithms__ben_deane__cppcon_2020.pdf
constructing_generic_algorithms__ben_deane__cppcon_2020.pdfconstructing_generic_algorithms__ben_deane__cppcon_2020.pdf
constructing_generic_algorithms__ben_deane__cppcon_2020.pdf
 
Games
GamesGames
Games
 
Meta Monte-Carlo Tree Search
Meta Monte-Carlo Tree SearchMeta Monte-Carlo Tree Search
Meta Monte-Carlo Tree Search
 
Scott Clark, Software Engineer, Yelp at MLconf SF
Scott Clark, Software Engineer, Yelp at MLconf SFScott Clark, Software Engineer, Yelp at MLconf SF
Scott Clark, Software Engineer, Yelp at MLconf SF
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 

Recently uploaded

Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 

Recently uploaded (20)

Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 

Simulation-based optimization: Upper Confidence Tree and Direct Policy Search

  • 1. Simulations for combining heuristics and consistent algorithms Applications to Minesweeper, the game of Go and Power Grids O. Buffet, A. Couëtoux, H. Doghmen, W. Lin, O. Teytaud, & many others
  • 2. In France (RTE+Luc Lasne)
  • 3. Beautiful spatially distributed problem ● Short term (~10s): dispatching ● Real-time control, humans in the loop ● Days, weeks: combinatorial optimization ● Years: hydroelectric stocks ● Stochasticity (provides price of water for week- ahead) ● 50 years: investments ● Expensive optimization of strategies (parallel) ● Uncertainties: Multiobjective (many!) ? Worst- case ?
  • 4. Goals / tools ● Optimizing investments – For the next 50 years – In Europe / North Africa ● Taking into account power plants / networks ● Multi-objective (scenarios), visualization with: ● Collaboration with a company (data, models) ● 3 ph.D. students full time ● dedicated machine 500/1000 cores
  • 5. Main ideas ● We like simulation-based optimizers ● For analyzing simulations ● For bilevel optimization (“anytime” criterion: smooth performance improvement) ● Required knowledge is a simulator: ● No dependency on additional knowledge ● Simplified model (linearized...) not necessary ● But we want to be able to plug expertise in terms of strategy (e.g. handcrafted approximate policy)
  • 6. Main ideas ● We like simulation-based optimizers ● For analyzing simulations ● For bilevel optimization (“anytime” criterion: smooth performance improvement) ● All we want as required knowledge is a simulator: ● No dependency on additional knowledge ● No simplified model ● But we want to be able to plug expertise in terms of strategy (e.g. handcrafted approximate policy Tools: ● Upper Confidence Tree = Adaptive Simulator, good for combinatorial aspects ● Direct Policy Search = Adaptive Simulator, good for long term effects
  • 7. A great challenge: MineSweeper. - looks easy - in fact, not easy: many myopic (one- step-ahead) approaches. - partially observable
  • 8. 1. Rules of MineSweeper 2. State of the art 3. The CSP approach 4. The UCT approach 5. The best of both worlds
  • 9. RULES At the beginning, all locations are Covered (unkwown).
  • 11. Good news! No mine in the neighborhood! I can “click” all the neighbours.
  • 12. I have 3 uncovered neighbors, and I have 3 mines in the neighborhood ==> 3 flags!
  • 13.
  • 14. I know it's a mine, so I put a flag!
  • 16. I play here and I lose...
  • 17. The most successful game ever! Who in this room never played Mine- Sweeper ?
  • 18. 1. Rules of MineSweeper 2. State of the art 3. The CSP approach 4. The UCT approach 5. The best of both worlds
  • 19. Do you think it's easy ? (10 mines) MineSweeper is not simple.
  • 21. What is the optimal move ? Remark: the question makes sense, without Knowing the history. You don't need the history for playing optimaly. ==> (this fact is mathematically non trivial!)
  • 22. What is the optimal move ? This one is easy. Both remaining locations win with proba 50%.
  • 23. More difficult! Which move is optimal ? Here, the classical approach (CSP) is wrong.
  • 24. Probability of a mine ? - Top: - Middle: - Bottom:
  • 25. Probability of a mine ? - Top: 33% - Middle: - Bottom:
  • 26. Probability of a mine ? - Top: 33% - Middle: 33% - Bottom:
  • 27. Probability of a mine ? - Top: 33% - Middle: 33% - Bottom: 33%
  • 28. Probability of a mine ? - Top: 33% - Middle: 33% - Bottom: 33% ==> so all moves equivalent ?
  • 29. Probability of a mine ? - Top: 33% - Middle: 33% - Bottom: 33% ==> so all moves equivalent ? ==> NOOOOO!!!
  • 30. Probability of a mine ? - Top: 33% - Middle: 33% - Bottom: 33% Top or bottom: 66% of win! Middle: 33%!
  • 31. The myopic (one-step ahead) approach plays randomly. The middle is a bad move! Even with same proba of mine, some moves are better than others!
  • 32. State of the art: - solved in 4x4 - NP-complete - Constraint Satisfaction Problem approach: = Find the location which is less likely to be a mine, play there. ==> 80% success “beginner” (9x9, 10 mines) ==> 45% success “intermediate” (16x16, 40 mines) ==> 34% success “expert” (30x40, 99 mines)
  • 33. 1. Rules of MineSweeper 2. State of the art 3. The CSP approach (and other old known methods) 4. The UCT approach 5. The best of both worlds
  • 34. - Exact MDP: very expensive. 4x4 solved. - Single Point Strategy (SPS): simple local solving - CSP (constraint satisf. problem): the main approach. - (unknown) state: x(i) = 1 if there is a mine at location i - each visible location is a constraint: If location 15 is labelled 4, then the constraint is x(04)+x(05)+x(06) +x(14)+ x(16) +x(24)+x(25)+x(26) = 4. - find all solutions x1, x2, x3,...,xN - P(mine in j) = (sumi Xij ) / N <== this is math. proved! - play j such that P(mine in j) minimal - if several such j, randomly break ties. MDP= Markov Decision Process CSP = Constraint Satisfaction Problem
  • 35. CSP as modified by Legendre et al, 2012: - (unknown) state: x(i) = 1 if there is a mine at location i - each visible location is a constraint: If location 15 is 4, then the constraint is x(04)+x(05)+x(06) +x(14)+ x(16) +x(24)+x(25)+x(26) = 4. - find all solutions x1, x2, x3,...,xN - P(mine in j) = (sumi Xij ) / N <== this is math. proved! - play j such that P(mine in j) minimal - if several such j, choose one “closest to the frontier” (proposed by Legendre et al) - if several such j, randomly break ties.
  • 36. CSP - is very fast - but it's not optimal - because of Here CSP plays randomly! Also for the initial move: don't play randomly the first move! (sometimes opening book)
  • 37. 1. Rules of MineSweeper 2. State of the art 3. The CSP approach 4. The UCT approach 5. The best of both worlds
  • 38. The MCTS approach ● Random simulations (Bruegman, 93) ● A tree of possible futures, increasing along simulations (several simultaneous papers, 2006)
  • 39. Why not UCT ? - looks like a stupid idea at first view - can not compete with CSP in terms of speed - But at least UCT is consistent: if given sufficient time, it will play optimally. - Tested in Couetoux and Teytaud, 2011
  • 40. UCT (Upper Confidence Trees) Coulom (06) Chaslot, Saito & Bouzy (06) Kocsis Szepesvari (06)
  • 41. UCT
  • 42. UCT
  • 43. UCT
  • 44. UCT
  • 45. UCT Kocsis & Szepesvari (06)
  • 47. Exploitation ... SCORE = 5/7 + k.sqrt( log(10)/7 )
  • 48. Exploitation ... SCORE = 5/7 + k.sqrt( log(10)/7 )
  • 49. Exploitation ... SCORE = 5/7 + k.sqrt( log(10)/7 )
  • 50. ... or exploration ? SCORE = 0/2 + k.sqrt( log(10)/2 )
  • 51. UCT in one slide Great progress in the game of Go and in various other games
  • 52. UCT in one slide C SP by se the al 2012 We u re et d Legen expansion for ulation . a nd sim
  • 53. Applying UCT here ? • Might look like ``too much'' • But in many cases CSP is suboptimal • We have seen an example of suboptimal move by CSP a few slides ago • Let's see two additional examples
  • 54. An example showing that the initial move matters (UCT finds it, not CSP).. 3x3, 7 mines: the optimal move is anything but the center. Optimal winning rate: 25%. Optimal winning rate if random uniform initial move: 17/72. (yes we get 1/72 improvement!)
  • 55. Second such example: 15 mines on 5x5 board with GnoMine rule (i.e. initial move is a 0, i.e. no mine in the neighborhood) Optimal success rate = 100%!!!!! Play the center, and you win (well, you have to work...) The myopic CSP approach does not find it.
  • 56. 1. Rules of MineSweeper 2. State of the art 3. The CSP approach 4. The UCT approach 5. The best of both worlds
  • 57. Summary I have two approaches: ● CSP: ● Fast ● Suboptimal (myopic, only 1-step ahead) ● UCT: ● needs a generative model (probability of next states, given my action), ● Asymptotically optimal
  • 58. The best of both worlds ? ● CSP: ● Fast ● Suboptimal (myopic, only 1-step ahead) ● UCT: ● needs a generative model by CSP, ● Asymptotically optimal
  • 59. What do I need for implementing UCT ? A complete generative model. Given a state and an action, I must be able to simulate possible transitions. State S, Action a: (S,a) ==> S' Example: given the state below, and the action “top left”, what are the possible next states ?
  • 60. What do I need for implementing UCT ? A complete generative model. Given a state and an action, I must be able to simulate possible transitions. State S, Action a: (S,a) ==> S' Example: given the state below, and the action “top left”, what are the possible next states ?
  • 61. What do I need for implementing UCT ? A complete generative model. Given a state and an action, I must be able to simulate possible transitions. State S, Action a: (S,a) ==> S' Example: given the state below, and the action “top left”, what are the possible next states ?
  • 62. What do I need for implementing UCT ? A complete generative model. Given a state and an action, I must be able to simulate possible transitions. State S, Action a: (S,a) ==> S' Example: given the state below, and the action “top left”, what are the possible next states ?
  • 63. What do I need for implementing UCT ? A complete generative model. Given a state and an action, I must be able to simulate possible transitions. State S, Action a: (S,a) ==> S' Example: given the state below, and the action “top left”, what are the possible next states ?
  • 64. We published a version of UCT for MineSweeper in which this was What do I need for implementing UCT ? implemented using A complete generative model. Given a state and an action, the rejection method only. I must be able to simulate possible transitions. State S, Action a: (S,a) ==> S' Example: given the state below, and the action “top left”, what are the possible next states ?
  • 65. Rejection algorithm: 1- randomly draw the mines What do I need for implementing UCT ? Given 2- if and an action, return the new observation a state it's ok, A complete generative model. 3- otherwise, go back to 1. I must be able to simulate possible transitions. State S, Action a: (S,a) ==> S' Example: given the state below, and the action “top left”, what are the possible next states ?
  • 66. It is mathematically ok, but it is too slow. Then,need for used a UCT ? CSP implementation. What do I we implementing weak A complete generative model. Given a state and an action, Still too slow. Now a reasonably fast implementation, with I must be able to simulate possible transitions. State S, Action a: (S,a) ==> S' Legendre et al heuristic. Example: given the state below, and the action “top left”, what are the possible next states ?
  • 67. EXPERIMENTAL RESULTS Huge computation 10 000 UCT-simulations time per move Our results (total = a few days)
  • 68. CONCLUSIONS: a methodology for sequential decision making - When you have a myopic solver (i.e. which neglects long term effects, as too often in industry!) ==> improve it with heuristics (as Legendre et al) ==> combine with UCT (as we did) ==> significant improvements - We have similar experiments on industrial testbeds
  • 69. Main ideas ● We like simulation-based optimizers ● For analyzing simulations ● For bilevel optimization (“anytime” criterion: smooth performance improvement) ● All we want as required knowledge is a simulator: ● No dependency on additional knowledge ● No simplified model ● But we want to be able to plug expertise in terms of strategy (e.g. handcrafted approximate policy Tools: ● Upper Confidence Tree = Adaptive Simulator, good for combinatorial aspects ● Direct Policy Search = Adaptive Simulator, good for long term effects
  • 70. What is Direct Policy Search ? ● I have a parametric policy decision = p(w,s) – Inputs: parameter w, state s – Output: decision p(w,s) – E.g. ● p(w,s) = w.s (scalar product) ● p(w,s) = W0 + W1 x tanh(W2 x s+W3) (neural network) ● I have a simulator cost = simulate(p, transition): – Inputs = policy p , transition function – Output = cost (possibly noisy) – Principle: – state = initial state – While (state not final) of ● decision=p(state) p art . state=transition(state,decision) ig ork A b ur w ● ● Direct Policy Search( transition , policy p(.,.) ): o – w = argmin simulate( p(w,.) ) // with your favorite optimization algorithm – return p(w,.)
  • 71. Example of policy p(w,state) = decision such that ● Alpha = W0+W1tanh(W2.state)+W3 ● Week-Ahead-Reward(decision) + Alpha.Stock(decision) is maximum ==> if linear transition, compliant with huge action dimension ==> non-linearities handled by the neural net
  • 72. One or two rivers (7 stocks)
  • 73. Network of rivers (7 rivers)
  • 75. Summary Two simulation-based tools for Sequential Decision Making: ● UCT = a nice tool for short term combinatorial effect ● DPS = a stable tool for long term effects Both: ● Are anytime ● Provide simulation results ● Can take into account non-linear effects ● High-dimension of the state space
  • 76. Thanks for your attention! 9 Mines. What is the optimal move ?