Nevergrad: our platform for evolutionary and derivative free optimization.

Derivative free optimization.
Jeremy Rapin, Tristan Cazenave, Camille Couprie, Pauline Luc, Maxime Oquab, , Baptiste Roziere, Olivier Teytaud
Facebook AI Research
(+thanks many people!)
2019

10
Methods
Evolution, Bayesian
optimization, genetic,
sequential quadratic
programming…
Examples
Let us save the world.
Is it useful ?
Yes.
What is
derivative free
optimization ?
It’s optimization without
derivatives.
• 1 2 3 4
Outline
1

11
What is derivative free optimization ?
(Numerical) optimization is about finding the (argument) minimum of f.
It’s optimization without derivatives.
Maybe you have learnt Newton, BFGS, etc ? These algorithms need the gradient of f.
It’s finding (approx.) argmin f without knowing gradient(f), just with a black-box x à f(x).
It’s finding f* such that for almost all x, f(x) >= f(f*).
1
Evolutionary programming
Mathematical programming
(cobyla, sqp…)
Design of experiments
Bayesian Optimization (EGO)

12
Methods
Evolution, Bayesian
programming…
Examples
Let us save the
world.
Is it useful ?
Yes.
What is
derivative free
optimization ?
derivatives.
2 3 4
Outline
1

DFO for games ?
- Partial obs.: ok
- No gradient: ok
- Multi-agent ok
- Collaborative or not
- Two-levels ok (e.g. each agent has its
training, as in PBT)
2019
Derivative-free optimization
Transition
function
Agent 1
Agent 2
Agent 3
State
Observ. 1
Observ. 2
Observ. 3

14
Advantages of DFO
• Fog of war (partial information) taken into account
• Bounded rationality taken into account
• No gradient ok.
• Noise ok.
• Simple, transparent: small code contains all the
information for justifying the decisions
è you know why power plant X is built (transparency++).
Examples:
• Power systems
• Roads
• Irrigation
• Games
4
Several agents
èno problem AskTellRecom.
No gradient
èno problem for DFO.
èCompared to Bellman-style methods
or model-predictive control: far less
assumptions. No parameter such as
operational/tactical/strategic horizon

15
Serious game in power systems:
Each power plant is maximizing its income.
Which law do we need for making this ~ equivalent to global
maximization ? No details here, but a real problem.
4
Several agents
No gradient
èCompared to Bellman-style methods
or model-predictive control: far less
assumptions. No parameter such as
operational/tactical/strategic horizon

21
Methods
Evolution, Bayesian
programming…
Examples
Let us save the
world.
Is it useful ?
Yes.
What is
derivative free
optimization ?
derivatives.
1 2 3 4
Outline

22Methods
Evolution, Bayesian optimization, genetic, sequential quadratic programming…
Let us discuss evolution strategies!
3

23
Evolution strategies
(1+1)-ES:
x(0) = (0, 0)
σ(0) = 1
for n in {0, 1, 2, 3, …}
x’ = x(n) + σ(n) x Gaussian
if x’ better than x(n):
x(n+1) = x’
3

24
Problem: close to the optimum, we might want to reduce σ.
(1+1)-ES with one-fifth rule:
x(0) = (0, 0)
σ(0) = 1
for n in {0, 1, 2, 3, …}
x(n+1) = x’
σ(n+1) = 2 σ(n)
else:
σ(n+1) = 0.84 σ(n)
σ very big è success rate goes to what ?
σ very small è success rate what ?, but slow
progress.
Equilibrium when P(success) = What ?
because 0.84^4 == 1 / 2

25
Problem: close to the optimum, we might want to reduce σ.
(1+1)-ES with one-fifth rule:
x(0) = (0, 0)
σ(0) = 1
for n in {0, 1, 2, 3, …}
x(n+1) = x’
σ(n+1) = 2 σ(n)
else:
σ(n+1) = 0.84 σ(n)
3
σ very big è success rate goes to 0.
σ very small è success rate ½, but
slow progress.
Equilibrium when P(success) = 1/5
because 0.84^4 == 1 / 2

26
Problem: we might want to be parallel! Evaluate λ individuals
simultaneously ?
(µ/µ, λ)-ES with self-adaptation:
x(0) = (0, 0)
σ(0) = 1
for n in {0, 1, 2, 3, …}
for i in {1, 2, 3, …, λ}
σ(n,i) = σ(n) x exp(1D-Gaussian)
x’(i) = x(n) + σ(n,i) x 2D-Gaussian
pick up the µ best x’(i) and their σ(n,i)
x(n+1) = average of those µ best
σ(n+1) = exp( average of these log σ(n,i))
3

27
Problem: isotropic mutations! We want to mutate some variables more
than others. E.g. f(x) = 100(x(1)-7)2 + x(2)2
(µ/µ, λ)-ES with anisotropic self-adaptation:
x(0) = (0, 0)
σ(0) = (1,1)
for n in {0, 1, 2, 3, …}
for i in {1, 2, 3, …, λ}
σ(n,i) = σ(n) *pointwise-product* exp(2D-Gaussian)
x’(i) = x(n) + σ(n,i) *pointwise-product* 2D-Gaussian
σ(n+1) = exp( average of these log σ(n,i) )
3

28
Problem: what if there is noise!
Like, my oven is not completely deterministic.
I want oven(temperature) to be excellent on average.
Problem: classical algorithms stagnate!
(µ/µ, λ)-ES with anisotropic self-adaptation and population control:
x(0) = (0, 0)
σ(0) = (1,1)
for n in {0, 1, 2, 3, …}
for i in {1, 2, 3, …, λ}
σ(n,i) = σ(n) *pointwise-product* exp(2D-Gaussian)
x’(i) = x(n) + σ(n,i) *pointwise-product* 2D-Gaussian
σ(n+1) = exp( average of these log σ(n,i) )
if average of population is signif. better than
average of the population 5 iterations earlier,
then decrease λ; otherwise increase λ.
3

29
Problem: what if it’s discrete ? Like, we optimize in
{0,1}10.
Discrete-(1+1)ES
x(0) = (0,0,0,…..,0,0,0,0)
for n in {0, 1, 2, 3, …}
x’ = mutation(x(n))
x(n+1) = x’
3
Mutation:
RLS: randomly mutate one of the variables.
(1+1)classical: each variable is mutated with
probability 1/10 (if no mutation, then discard/redraw)

30
Problem: what if mutating a small number of variables does
not work ? E.g. f(x) = bad value as long as we do not have at
least half bits equal to 1.
Discrete-(1+1)ES
x(0) = (0,0,0,…..,0,0,0,0)
for n in {0, 1, 2, 3, …}
x’ = mutation(x(n))
x(n+1) = x’
3
Mutation:
RLS: randomly mutate one of the variables.
(1+1)classical: each variable is mutated with
probability 1/10
(1+1)Mixing of mutation rates: randomly draw
the probability p of mutation in (0,1), then each
variable is mutated with probability p.
Duc-Cuong Dang
& Per Christian Lehre:
Uniform mixing of mutation rates
Carola Doerr and Benjamin Doerr:
Parameter control

31
Convergence rates ?
Self-adaptive algorithms, case f(x) = ||x||:
• (x(n)/sigma(n)) is an homogeneous Markov chain
• It has a stationary distribution (non trivial)
• E log ||x(n+1)|| - log ||x(n) || converges to a constant
• log||x(n)||/n converges to a constant
3
log||x(n)||
n
Auger, TCS

32
Convergence rates in the parallel case ?
Does the slope increase ?
• Slope is linear in λ if λ is smaller than the dimension
• Then it becomes logarithmic
3
Slope=
- log||x(n)||/n
λ
Linear speedup
dimension
Beyer, book
Fournier et al, PPSN & Algorithmica.

33
Convergence rates in the noisy case ?
Known for some algorithms (not the most usual ones…):
E log ||x(n)|| / log(n) è -1/2 for some robust ”repeat” algorithms
E log ||x(n)|| / log(n) è -1 for population control and
some other methods under some assumptions
3
log||x(n)||
n
Astete-Morales et al, Foga
Cauwet et al, Foga

344
Derivative-free methods
1.Random search: randomly draw 1 000 000 points, and pick up the best.
2.Estimation of Distribution Algorithm: while (budget not elapsed) randomly draw 1000 points,
select the 250 best, define a Gaussian matching those 250 points, repeat until budget elapsed.
3.Particle swarm optimization: define 50 particles in the domain, with random velocities.
Particles are attracted by their best visited point, and by the best point for the entire
population, and receive random noise.
4. Quasi-Random Search: similar to random search, but try to have a better positioning of
points, using low-discrepancy sequences.
… and so many others!

354
Derivative-free methods
1.CMA: extending anisotropic mutations to rotated ellipsoids
2.Differential Evolution: also rotated anisotropic mutations, with cheap computational cost
3. Lamarckism, i.e. inheritance of acquired characteristics:
a) Part of the state if evolution-optimized
b) The rest evolves during the evaluation (e.g. neural networks training)
èAs if your child could inherit your post-workout muscles or your trained neural net
è remember the inheritance of the immune system ?

36
Methods
Evolution, Bayesian
programming…
Examples
Let us save the
world.
Is it useful ?
Yes.
What is derivative
free optimization ?
derivatives.
.
1 2 3 4
Outline

37
Examples
4

38
What is the API of our optimization algorithms ?
• Dear optimizer, I’m asking you which point I should evaluate è
ASK.
• Dear optimizer, at point (3.1, 2.0), I tell you that the quality is 47
è TELL.
• Dear optimizer, I am tired now, please recommend a good point
è RECOMMEND.
è Asynchronous ok
è Multiagent ok.
è Noisy ok.
4

39
Optimize the hyperparameters
of machine learning algorithms or
the weights of a NN.
4
Much better than random search for
hyperparametrizing video prediction!
Much better than random search for
hyperparametrizing image generation!
Population control cool for neuro-playing
007!

40
Example (simplified): power systems.
• Time steps 1, 2, 3, …, 365 x 10 è 10 years.
• For i in {1,2,3,…,12}:
• One stock Si of water for hydroelectricity.
• A random process Ri adding water in Si.
• A random process C “electricity consumption” for each time step.
• A market converting the productions and C into cost per MWh.
• An agent Ai converting water from Si into electricity Ei.
Agent Ai = mapping (X, C, S1, S2, S3, S4, …, S12) to a real number.
This mapping has parameters. For example a neural network.
Simulator(parameters of agents):
- For each time step
- For each stock Si, update it with Ri, and Ai(params) converts some water to elect.
- Quality for society = ecological / economical penalty associated to the production (includes
some agents).
- Quality for independent agents = money earned.
- è one optimizer per independent agent and one optimizer for others + society.
4
Several agents
No gradient
èCompared to Bellman-style methods:
far less assumptions. No parameter
such as operationa/tactical/strategic
horizon

41
Nevergrad: super easy to use! Needs Python >= 3.6.
Installtion = pip install nevergrad
Or (developer):
• “git clone” the repository
• Run “ pip install -e .[all] ”
4
(cobyla, sqp…)

424X=deceptive (super hard functions)
X=parallel
X=oneshot
X=illcondi
X=realworld
(cobyla, sqp…)

434
No time for installing, for coding, for experimenting, or I don’t want
to take care of distributing my experiments on many machines.
No problem !!!
Just hack the code directly on the github interface:
- Duplicate the code of one of the optimizers in
optimizerlib.py
- Give it another name, modify it using your knowledge.
- Add your optimizer name in one of the experiments in
experiments.py
èCreate a pull request and our servers will run your
experiments a couple of days after it’s merged and the
result will be visible on
https://dl.fbaipublicfiles.com/nevergrad/allxps/list.html

444
Example: TBPSA (variant of pop. Control)
for games
No redundant
representation è pop
control fails
Redundant
representation è pop
control excellent!
Why ?

454
Example: random seeds!
Consider a stochastic player
Action: f(random, situation)
Now:
- Situation à hash à hash % 5000
- Parameter = vector of 5000 seeds
- Seed 0 means “time” seed (i.e. no seed)
à Looks weird, but works!
à Optimizes the opening book in partially observable games e.g. phantom-Go
World Computer-Bridge Championship in September 2016: Wbridge5.
Fast and easy
experiments

464
Example: game of war; evolution and/or bandits
(strategy: order in which you pick up the cards you earn!)

474
Examples: guesswho game
“Shobute”

484
Our noisy optimization methods

Nevergrad: our platform for evolutionary and derivative free optimization.

More Related Content

Recently uploaded

Featured

Nevergrad: our platform for evolutionary and derivative free optimization.