Simulation-based optimization: Upper Confidence Tree and Direct Policy Search

Simulations for combining
heuristics and consistent
algorithms

Applications to Minesweeper,
the game of Go
and Power Grids
O. Buffet, A. Couëtoux, H. Doghmen,
W. Lin, O. Teytaud,
& many others

Beautiful spatially
distributed problem
● Short term (~10s): dispatching
● Real-time control, humans in the loop
● Days, weeks: combinatorial optimization
● Years: hydroelectric stocks
● Stochasticity (provides price of water for week-
ahead)
● 50 years: investments
● Expensive optimization of strategies (parallel)
● Uncertainties: Multiobjective (many!) ? Worst-
case ?

Goals / tools
● Optimizing investments
– For the next 50 years
– In Europe / North Africa
● Taking into account power plants / networks
● Multi-objective (scenarios), visualization

with:
● Collaboration with a company (data, models)
● 3 ph.D. students full time
● dedicated machine 500/1000 cores

Main ideas
● We like simulation-based optimizers
● For analyzing simulations
● For bilevel optimization (“anytime” criterion: smooth
performance improvement)
● Required knowledge is a simulator:
● No dependency on additional knowledge
● Simplified model (linearized...) not necessary
● But we want to be able to plug expertise in
terms of strategy (e.g. handcrafted approximate
policy)

Main ideas
● We like simulation-based optimizers
● For analyzing simulations
● For bilevel optimization (“anytime” criterion: smooth performance improvement)
● All we want as required knowledge is a simulator:
● No dependency on additional knowledge
● No simplified model
● But we want to be able to plug expertise in terms of strategy (e.g.
handcrafted approximate policy

Tools:
● Upper Confidence Tree = Adaptive Simulator, good for
combinatorial aspects
● Direct Policy Search = Adaptive Simulator, good for long term
effects

A great challenge: MineSweeper.

- looks easy
- in fact, not easy:
many myopic (one-
step-ahead)
approaches.
- partially observable

1. Rules of MineSweeper

2. State of the art

3. The CSP approach

4. The UCT approach

5. The best of both worlds

RULES

At the
beginning,
all
locations
are
Covered
(unkwown).

Good news!

No mine in
the
neighborhood!

I can “click”
all the
neighbours.

I have 3
uncovered
neighbors,
and I have 3
mines in the
neighborhood
==> 3 flags!

I know
it's a
mine,
so I put
a flag!

The most
successful
game ever!
Who in this
room never
played
Mine-
Sweeper ?

Do you
think it's
easy ?
(10 mines)

MineSweeper
is not simple.

What is
the optimal
move ?

Remark: the question makes sense, without
Knowing the history.
You don't need the history for playing optimaly.
==> (this fact is mathematically non trivial!)

What is
the optimal
move ?

This one is easy.

Both remaining locations win with proba 50%.

More
difficult!
Which
move is
optimal ?

Here, the
classical
approach
(CSP)
is wrong.

Probability
of a mine ?
- Top:
- Middle:
- Bottom:

Probability
of a mine ?
- Top: 33%
- Middle:
- Bottom:

Probability
of a mine ?
- Top: 33%
- Middle: 33%
- Bottom:

Probability
of a mine ?
- Top: 33%
- Middle: 33%
- Bottom: 33%

Probability
of a mine ?
- Top: 33%
- Middle: 33%
- Bottom: 33%

==> so all moves
equivalent ?

Probability
of a mine ?
- Top: 33%
- Middle: 33%
- Bottom: 33%

==> so all moves
equivalent ?
==> NOOOOO!!!

Probability
of a mine ?
- Top: 33%
- Middle: 33%
- Bottom: 33%

Top or bottom:
66% of win!

Middle: 33%!

The myopic
(one-step ahead)
approach plays
randomly.

The middle is a
bad move!

Even with same
proba of mine,
some moves are
better than others!

State of the art:
- solved in 4x4
- NP-complete
- Constraint Satisfaction Problem approach:
= Find the location which is less likely
to be a mine, play there.
==> 80% success “beginner” (9x9, 10 mines)
==> 45% success “intermediate” (16x16, 40
mines)
==> 34% success “expert” (30x40, 99 mines)

1. Rules of MineSweeper

2. State of the art

3. The CSP approach
(and other old known methods)

4. The UCT approach

5. The best of both worlds

- Exact MDP: very expensive. 4x4 solved.
- Single Point Strategy (SPS): simple local solving
- CSP (constraint satisf. problem): the main approach.
- (unknown) state:
x(i) = 1 if there is a mine at location i
- each visible location is a constraint:
If location 15 is labelled 4, then the constraint is
x(04)+x(05)+x(06)
+x(14)+ x(16)
+x(24)+x(25)+x(26) = 4.
- find all solutions x1, x2, x3,...,xN
- P(mine in j) = (sumi Xij ) / N <== this is math. proved!
- play j such that P(mine in j) minimal
- if several such j, randomly break ties.

MDP= Markov Decision Process
CSP = Constraint Satisfaction Problem

CSP as modified by Legendre et al, 2012:

- (unknown) state:
x(i) = 1 if there is a mine at location i
- each visible location is a constraint:
If location 15 is 4, then the constraint is
x(04)+x(05)+x(06)
+x(14)+ x(16)
+x(24)+x(25)+x(26) = 4.
- find all solutions x1, x2, x3,...,xN
- P(mine in j) = (sumi Xij ) / N <== this is math. proved!
- play j such that P(mine in j) minimal
- if several such j, choose one “closest to the frontier”
(proposed by Legendre et al)
- if several such j, randomly break ties.

CSP
- is very fast
- but it's not optimal
- because of

Here CSP plays randomly!
Also for the initial move: don't play
randomly the first move! (sometimes opening book)

The MCTS approach

● Random simulations
(Bruegman, 93)

● A tree of possible futures, increasing
along simulations
(several simultaneous papers, 2006)

Why not UCT ?
- looks like a stupid idea at first view
- can not compete with CSP in terms of speed
- But at least UCT is
consistent: if given
sufficient
time, it will play
optimally.
- Tested in Couetoux
and Teytaud, 2011

UCT (Upper Confidence Trees)

Coulom (06)
Chaslot, Saito & Bouzy (06)
Kocsis Szepesvari (06)

UCT
Kocsis & Szepesvari (06)

Exploitation ...
SCORE =
5/7
+ k.sqrt( log(10)/7 )

... or exploration ?
SCORE =
0/2
+ k.sqrt( log(10)/2 )

UCT in one slide

Great progress in the game of Go and in various other games

UCT in one slide

C SP by
se the al 2012
We u re et
d
Legen expansion
for ulation
.
a nd sim

Applying UCT here ?
• Might look like ``too much''
• But in many cases CSP is suboptimal
• We have seen an example of suboptimal
move by CSP a few slides ago
• Let's see two additional examples

An example showing that the initial
move matters (UCT finds it, not CSP)..

3x3, 7 mines:
the optimal move
is anything but the center.
Optimal winning rate: 25%.
Optimal winning rate if
random uniform
initial move: 17/72.

(yes we get 1/72
improvement!)

Second such example:
15 mines on 5x5 board with
GnoMine rule
(i.e. initial move is a 0, i.e. no
mine in the neighborhood)
Optimal success rate = 100%!!!!!
Play the center, and you win (well, you have to work...)
The myopic CSP approach does not find it.

Summary
I have two approaches:
● CSP:

● Fast

● Suboptimal (myopic, only 1-step ahead)

● UCT:

● needs a generative model (probability of next
states, given my action),

● Asymptotically optimal

The best of both worlds ?

● CSP:

● Fast

● Suboptimal (myopic, only 1-step ahead)

● UCT:

● needs a generative model by CSP,

● Asymptotically optimal

What do I need for implementing UCT ?
A complete generative model.
Given a state and an action,
I must be able to simulate possible transitions.
State S, Action a:
(S,a) ==> S'
Example: given the state below, and the action “top left”, what
are the possible next states ?



State S, Action a:
(S,a) ==> S'

Example: given the state below, and the action “top left”, what are the possible next
states ?

We published a version of UCT
for MineSweeper in which this was

implemented using
the rejection method only.

State S, Action a:
(S,a) ==> S'

states ?

Rejection algorithm:
1- randomly draw the mines

Given 2- if and an action, return the new observation
a state it's ok,

3- otherwise, go back to 1.

State S, Action a:
(S,a) ==> S'

states ?

It is mathematically ok, but it is too slow.
Then,need for used a UCT ? CSP implementation.
What do I
we implementing weak
Still too slow.
Now a reasonably fast implementation, with

State S, Action a:
(S,a) ==> S'
Legendre et al heuristic.
states ?

EXPERIMENTAL RESULTS

Huge
computation
10 000 UCT-simulations time
per move Our results
(total = a few days)

CONCLUSIONS: a
methodology for sequential
decision making

- When you have a myopic solver
(i.e. which neglects long term
effects, as too often in industry!)
==> improve it with heuristics (as
Legendre et al)
==> combine with UCT (as we did)
==> significant improvements

- We have similar experiments on
industrial testbeds

What is Direct Policy Search ?
● I have a parametric policy decision = p(w,s)
– Inputs: parameter w, state s
– Output: decision p(w,s)
– E.g.
● p(w,s) = w.s (scalar product)
● p(w,s) = W0 + W1 x tanh(W2 x s+W3) (neural network)
● I have a simulator cost = simulate(p, transition):
– Inputs = policy p , transition function
– Output = cost (possibly noisy)
– Principle:
– state = initial state
– While (state not final) of
● decision=p(state)
p art .
state=transition(state,decision)
ig ork
A b ur w
●

● Direct Policy Search( transition , policy p(.,.) ):
o
– w = argmin simulate( p(w,.) )
// with your favorite optimization algorithm
– return p(w,.)

Example of policy
p(w,state) = decision such that
● Alpha = W0+W1tanh(W2.state)+W3
● Week-Ahead-Reward(decision) +
Alpha.Stock(decision) is maximum

==> if linear transition, compliant with huge
action dimension
==> non-linearities handled by the neural net

Summary

Two simulation-based tools for Sequential Decision
Making:
● UCT = a nice tool for short term combinatorial effect
● DPS = a stable tool for long term effects

Both:
● Are anytime
● Provide simulation results
● Can take into account non-linear effects
● High-dimension of the state space

Thanks for your
attention!

9 Mines.
What is the
optimal move ?

Simulation-based optimization: Upper Confidence Tree and Direct Policy Search

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (11)

Similar to Simulation-based optimization: Upper Confidence Tree and Direct Policy Search

Similar to Simulation-based optimization: Upper Confidence Tree and Direct Policy Search (20)

Recently uploaded

Recently uploaded (20)

Simulation-based optimization: Upper Confidence Tree and Direct Policy Search