Voronoi optimistic optimization (VOO) is extended to Monte Carlo tree search (MCTS) to solve problems with continuous and high-dimensional action spaces. VOO implicitly constructs Voronoi cells around sampled points to balance exploration and exploitation without selecting partition dimensions. This Voronoi optimistic optimization applied to trees (VOOT) method outperforms Bayesian optimization and hierarchical partitioning methods on benchmark functions. The paper also proves a regret bound for VOOT, showing the number of simulations needed to guarantee a given level of performance. VOOT is tested on a packing problem with continuous box placements, outperforming state-of-the-art reinforcement learning and continuous-action MCTS algorithms.
Monte Carlo Tree Search in continuous spaces using Voronoi optimistic optimization with regret bounds
1. Monte Carlo Tree Search in continuous spaces
using Voronoi optimistic optimization
with regret bounds
Beomjoon Kim
(with Kyungjae Lee, Sungbin Lim, Leslie Pack Kaelbling, Tomas Lozano-Perez)
7. Main question addressed in our paper
How can we extend MCTS to
high-dimensional continuous
action space problems?
8. First, consider the easier problem
• Assume planning horizon is 1
The problem is reduced to a budgeted-black-box-
function optimization (BBFO) problem
9. First, consider the easier problem
• Assume planning horizon is 1
• The problem is reduced to a budgeted-black-
box-function optimization (BBFO) problem
10. First, consider the easier problem
• Assume planning horizon is 1
• The problem is reduced to a budgeted-black-
box-function optimization (BBFO) problem
11. First, consider the easier problem
• Assume planning horizon is 1
• The problem is reduced to a budgeted-black-
box-function optimization (BBFO) problem
within n number of evaluations
15. A class of BBFO methods
Bayesian
Optimization
(BO)
EI (Mockus, 1974)
PI (Kushner, 1964)
GP-UCB(Srinivas et al., 2010)
ES (Hennig & Schuler, 2012)
PES (Hernandez-Lobato et al., 2014)
EST (Wang et al., 2016)
16. Drawback in high-dimensional search spaces
Bayesian
Optimization
(BO)
Hierarchical
Partitioning
(HP)
• EI (Mockus, 1974)
• PI (Kushner, 1964)
• GP-UCB( Auer, 2002; Srinivas et
al., 2010)
• ES (Hennig & Schuler, 2012)
• PES (Hernandez-Lobato et al.,
2014)
• EST (Wang et al., 2016)
Requires global
optimization of the
acquisition function
17. Another class of BBFO methods
Bayesian
Optimization
(BO)
Hierarchical
Partitioning
(HP)
• EI (Mockus, 1974)
• PI (Kushner, 1964)
• GP-UCB( Auer, 2002; Srinivas et
al., 2010)
• ES (Hennig & Schuler, 2012)
• PES (Hernandez-Lobato et al.,
2014)
• EST (Wang et al., 2016)
Requires global
optimization of the
acquisition function
Hierarchical
Partitioning
(HP)
DOO(Munos, 2011)
SOO(Valko et al., 2013)
HOO(Bubeck et al.,
2010)
26. Selecting the most promising cell
• Deterministic Optimistic Optimization [Munos 2011]
Compute b-values at each cell, where
27. Selecting the most promising cell
• Deterministic Optimistic Optimization [Munos 2011]
• Compute the b-value of each cell, where
28. Selecting the most promising cell
• Deterministic Optimistic Optimization [Munos 2011]
• Compute the b-value of each cell, where
29. Selecting the most promising cell
• Deterministic Optimistic Optimization [Munos 2011]
• Compute the b-value of each cell, where
Exploration bonus
Prefers larger cells
35. DOO: exploration vs. exploitation
Exploitation of current function knowledge
prefer cells that have high function values
36. DOO: exploration vs. exploitation
Exploration of the search
space
prefer cells with large diameters
Exploitation of current function knowledge
prefer cells that have high function values
41. BUT: it requires deciding which dimension to cut
In -dimensional search space, you
have number of choices
42. BUT: it requires deciding which dimension to cut
Finding the
optimal sequence of dimensions
to cut is not trivial
In -dimensional search space, you
have number of choices
43. DOO’s intuitive idea
• DOO idea:
• The larger the cell, the more the exploration
bonus
• The larger the value of the cell, the more the
exploitation bonus
44. Our main question
DOO idea:
The larger the cell, the more the exploration bonus
The larger the value of the cell, the more the
exploitation bonus
How can we use this idea
without cell constructions?
47. Our method: Implicitly construct Voronoi cells
Voronoi cell:
A set of points closer to its
generator than all the other
generators
48. Our method: Implicitly construct Voronoi cells
Example
generator
Voronoi cell:
A set of points closer to its
generator than all the other
generators
49. Our method: Implicitly construct Voronoi cells
Example
generator
Voronoi cell:
A set of points closer to its
generator than all the other
generators
50. Our method: Implicitly construct Voronoi cells
Example
generator
Voronoi cell:
A set of points closer to its
generator than all the other
generators
Evaluated points
represent generators
64. Use rejection sampling to do exploitation
Assume best point found
so far
Randomly Sample:
65. Use rejection sampling to do exploitation
Reject until the sampled
point is closest to the
best point
Assume best point found
so far
Randomly Sample:
78. 10D – Bayesian optimization methods
Bestvaluefoundsofar
Number of evaluations
REMBO
BaMSOO
GPUCB
79. 10D – Hierarchical partitioning methods
Bestvaluefoundsofar
Number of evaluations
SOO
DOO
80. 10D – Evolutionary algorithm
Bestvaluefoundsofar
Number of evaluations
CMA-ES
81. 10D – Voronoi optimistic optimization
Bestvaluefoundsofar
Number of evaluations
VOO
82. 20D – even more drastic difference
Bestvaluefoundsofar
Number of evaluations
VOO
83. From BBFO to MCTS
• So that was when planning horizon is 1
• What about planning horizon >= 1, for
deterministic environments?
84. Assign a VOO agent at each node
Solve for the optimal action with respect to the
estimated Q-function
But…we have access to , not the true
VOO applied to Trees (VOOT): high-level
idea
85. VOO applied to Trees (VOOT): high-level
idea
Assign a VOO agent at each node
Solve the action optimization problem using the
estimated Q-function
But…we have access to , not the true
95. Takeaways
Voronoi Optimistic Optimization applied to Trees
(VOOT) for deterministic environment
• Empirical contribution: improvement over the
state-of-the-art, especially in high-dimensional
action-spaces
• Theoretical contribution: regret-bound for
VOOT