Evolutionary deep learning: computer vision.

Evolutionary computer vision.
. 2019
Contact: oteytaud@fb.com
Olivier Teytaud, Facebook AI Research
OlIvier Teytaud
Messenger Olivier TEYTAUD
Whatsapp +44 7540 143007,
Started to work in AI last century.
Currently working on games, alphazero style learning,
derivative-free optimization.
Has been working at ARTELYS, INRIA, GOOGLE, FB.
Has 4 beautiful kids.

. 2019
Vision:
Camille Couprie, Facebook AI Research
Laurent Meunier, Facebook AI Research
Jeremy Rapin, Facebook AI Research
Baptiste Roziere, Facebook AI Research
Games 2:
Tristan Cazenave, Univ. Dauphine
Yen-Chi Chen, National Taiwan Normal
University
Guan-Wei Chen, National Dong Hwa University
Shi-Yu Chen, National Dong Hwa University
Xian-Dong Chiu, National Dong Hwa University
Julien Dehos, Univ. Littoral Cote d’Opale
Maria Elsa, National Dong Hwa University
Qucheng Gong, Facebook AI Research
Hengyuan Hu, Facebook AI Research
Vasil Khalidov, Facebook AI Research
Chen-Ling Li, National Dong Hwa University
Hsin-I Lin, National Dong Hwa University
Yu-Jin Lin, National Dong Hwa University
Games 1:
Xavier Martinet, Facebook AI Research
Vegard Mella, Facebook AI Research
Gabriel Synnaeve, Facebook AI Research
Fabien Teytaud, Univ. Littoral Cote d’Opale
Shi-Cheng Ye, National Dong Hwa University
Yi-Jun Ye, National Dong Hwa University
Shi-Jim Yen, National Dong Hwa University
Sergey Zagoruyko, Facebook AI Research

. 2019
Evolution for dummies
Application: adversarial attacks
Evolution for hyperparameters
Evolutionary GANs
Evolutionary Super Resolution
Vision for something else than vision: Polygames
THANK YOU !!!!
To people who
helped me for visa
issues J

8
Methods
Evolution, Bayesian
optimization,
genetic, sequential
quadratic
programming…
Example
s
Let us save
the world.
Is it
useful ?
Yes.
What is derivative
free optimization ?
It’s
optimization
without
derivatives.
• 1 2 3 4
Outline
1

9
What is derivative free optimization ?
(Numerical) optimization is about finding the (argument) minimum of f.
It’s optimization without derivatives.
Maybe you have learnt Newton, BFGS, etc ? These algorithms need the gradient of f.
It’s finding (approx.) argmin f without knowing gradient(f), just with a black-box x à f(x).
It’s finding f* such that for almost all x, f(x) >= f(f*).
1

1
8
Methods
Evolution,
Bayesian
optimization,
genetic, sequential
quadratic
programming…
Examples
Let us save the
world.
Is it useful ?
Yes.
What is
derivative free
optimization ?
It’s optimization without
derivatives.
2 3 4
Outline
1

1
9Is it useful ?
Yes.
2
Oven Quality
Quality
measurement
Temperature
Time

2
0Is it useful ?
Yes.
2
Oven Quality
Quality
measurement
Temperature
Time
Aerodynamism
simulator
Quality
Quality
measurement
Shape
parameters

2
1Is it useful ?
Yes.
2
Oven Quality
Quality
measurement
Temperature
Time
Aerodynamism
simulator
Quality
Quality
measurement
Shape
parameters
Agents
simulator
Quality
Quality
measurement
Regulation
parameters

2
2Is it useful ?
Yes.
2
Oven
Quality
(quiche quality or
ceramic quality)
Quality
measurement
Temperature
Time
Aerodynamism
simulator
Quality
(energy &
noise saving)
Quality
measurement
Shape
parameters
Agents
simulator
Quality
Quality
measurement
Regulation
parameters
Simulator
Quality
(power
capacity)
Quality
measurement
Position of wind
turbines

2
3
Methods
Evolution,
Bayesian
optimization,
genetic, sequential
quadratic
programming…
Examples
Let us save the
world.
Is it useful ?
Yes.
What is
derivative free
optimization ?
derivatives.
1 2 3 4
Outline

2
4
Methods
Evolution, Bayesian optimization, genetic, sequential quadratic programming…
Let us discuss evolution strategies!
3

2
5
Evolution strategies
(1+1)-ES:
x(0) = (0, 0)
σ(0) = 1
for n in {0, 1, 2, 3, …}
x’ = x(n) + σ(n) x Gaussian
if x’ better than x(n):
x(n+1) = x’
3

2
6
Problem: close to the optimum, we might want to
reduce σ.
(1+1)-ES with one-fifth rule:
x(0) = (0, 0)
σ(0) = 1
for n in {0, 1, 2, 3, …}
x(n+1) = x’
σ(n+1) = 2 σ(n)
else:
σ(n+1) = 0.84 σ(n)
σ very big è success rate goes to what ?
σ very small è success rate what ?, but slow
progress.
Equilibrium when P(success) = What ?
because 0.84^4 == 1 / 2

2
7
Problem: close to the optimum, we might want to
reduce σ.
(1+1)-ES with one-fifth rule:
x(0) = (0, 0)
σ(0) = 1
for n in {0, 1, 2, 3, …}
x(n+1) = x’
σ(n+1) = 2 σ(n)
else:
σ(n+1) = 0.84 σ(n)
3
σ very big è success rate goes to 0.
σ very small è success rate ½, but
slow progress.
Equilibrium when P(success) = 1/5
because 0.84^4 == 1 / 2

2
8
Problem: we might want to be parallel! Evaluate λ
individuals simultaneously ?
(µ/µ, λ)-ES with self-adaptation:
x(0) = (0, 0)
σ(0) = 1
for n in {0, 1, 2, 3, …}
for i in {1, 2, 3, …, λ}
σ(n,i) = σ(n) x exp(1D-Gaussian)
x’(i) = x(n) + σ(n,i) x 2D-Gaussian
pick up the µ best x’(i) and their σ(n,i)
x(n+1) = average of those µ best
σ(n+1) = exp( average of these log σ(n,i))
3

2
9
Problem: isotropic mutations! We want to mutate some
variables more than others. E.g. f(x) = 100(x(1)-7)2 + x(2)2
(µ/µ, λ)-ES with anisotropic self-adaptation:
x(0) = (0, 0)
σ(0) = (1,1)
for n in {0, 1, 2, 3, …}
for i in {1, 2, 3, …, λ}
σ(n,i) = σ(n) *pointwise-product* exp(2D-Gaussian)
x’(i) = x(n) + σ(n,i) *pointwise-product* 2D-Gaussian
pick up the µ best x’(i) and their σ(n,i)
x(n+1) = average of those µ best
σ(n+1) = exp( average of these log σ(n,i) )
3

3
6
4
Derivative-free methods
1.Random search: randomly draw 1 000 000 points, and pick up the best.
2.Estimation of Distribution Algorithm: while (budget not elapsed) randomly draw 1000 points,
select the 250 best, define a Gaussian matching those 250 points, repeat until budget
elapsed.
3.Particle swarm optimization: define 50 particles in the domain, with random velocities.
Particles are attracted by their best visited point, and by the best point for the entire
population, and receive random noise.
4. Quasi-Random Search: similar to random search, but try to have a better positioning of
points, using low-discrepancy sequences.
… and so many others! Discrete domains, mixed domains,multi-objective, with full covariance
adaptation, etc

3
7
Methods
Evolution,
Bayesian
optimization,
genetic, sequential
quadratic
programming…
Examples
Let us save
the world.
Is it useful ?
Yes.
What is
derivative free
optimization ?
derivatives.
.
1 2 3 4
Outline

3
8
Nevergrad: super easy to use!
4
X=deceptive (super hard functions)
X=parallel
X=oneshot
X=illcondi
X=realworld
Evolutionary programming
Mathematical programming
(cobyla, sqp…)
Design of experiments
Bayesian Optimization (EGO)
2200 +
github stars,
growing.

. 2019
Evolutionary GANs

Adversarial attacks: given a classifier, find a
small distortion so that it fails.
. 2019
Goodfellow et al, OpenAI

Black-box adversarial attacks
(no gradient, no white box info: you can just send an
image and get probabilities of classes)
. 2019
Goodfellow et al, OpenAI

Black-box adversarial attacks with tiling
. 2019

Black-box adversarial attacks with tiling and evolution
. 2019
State of the art ! Use a good library rather than
designing a bad ad hoc variant of random search…

. 2019
+Pauline Luc
Evolutionary GANs

4
9
Optimize the hyperparameters
of machine learning algorithms.
4
Much better than random search for
hyperparametrizing video prediction!
hyperparametrizing image generation!
Population control cool for neuro-playing
007!

5
0
Optimize the hyperparameters
of machine learning algorithms.
4
hyperparametrizing video prediction!
hyperparametrizing image generation!
Population control cool for neuro-playing
007!
Good because less overfitting.
Less parallel than random search but
still very parallel (e.g. just 4 batches!)
Target = areas stable by variable-wise
perturbation.

. 2019
+Morgane Riviere
Evolutionary GANs

Generative models
GAN: generative adversarial network
A loss for “the discriminator
must be unable to distinguish
fake from real” (impacting the
generator)
A loss for “the discriminator
must be able to distinguish fake
from real” (impacting the
discriminator)

NeuroEvolution & facial composites
GANs provide generators: given z (e.g. Gaussian in dim 256), G(z) is a
face (or a texture, or …):
Face FashionGen Textures
https://github.com/facebookresearch/pytorch_GAN_zoo
How to find a cool z ?
E.g. a face ~ Mickey
Mouse, or a dress
suggesting a given
flower.

Maybe z = argmin Dissimilarity(G(z), targetImage)
L2, VGG, …

• Pro:
• Simple
• No need for human interaction
• Non trivial if G was not trained on data ~ targetImage
• Con:
• Needs a target image
• Needs a dissimilariy, and similarities on images do not work that
well
But we want more than just a copy-paste!
L2, VGG, …
Discrim(G(z)),
(norm(z)-dim(z))2 …

+ penalization(z) ?
First idea = use Adam. Or SGD, or Nesterov momentum.
Better: use LBFGS. There is no stochasticity, Adam or SGD are just slower than LBFGS.
L2, VGG, …
Discrim(G(z)),

+ penalization(z) ?
• Pro:
• Simple
• No need for human interaction
• Non trivial if G was not trained on data ~ targetImage
• Con:
• Needs a target image: hard to beat “copy-paste”
• Needs a dissimilarity, and similarities on images do not work
that well
L2, VGG, …
Discrim(G(z)),

+ penalization(z) ?
First idea = use Adam. Or SGD, or Nesterov momentum.
Better: use LBFGS. There is no stochasticity, Adam or SGD are just slower than LBFGS.
Fun idea: try with evolutionary methods. Because they don’t need “Dissimilarity(G(z),
targetImage) + penalization(z)”, they just need answers to “is G(z1) better than G(z2) ? “
• No dissimilarity
• No penalization
==> it’s all in the user’s head.
L2, VGG, …
Discrim(G(z)),

LBFGS excellent without any tuning!

When the objective function is a proxy of the
real objective function, evolution ==> better!
è we need
robustness

Facial composites: select the 5 best

Facial composites: select the 5 best.
Evolution is great – we don’t need numerical criteria,
just comparisons!
Target 3 random reconstructions in 3 minutes each

Also for creating clothes !!!
(fashion gen)

HEVOL rendering
of Triss Merigold
(The Witcher)
Artist rendering of
Triss Merigold (The
Witcher)

Wait, what is the state of the art in facial
composites ?
Holistic evolutionary methods outperform standard “local decomposition”
methods: Frowd et al, 2004, 2010, 2013, Gibson et al 2009, Solomon et al 2009.
Compared to this:
èWe add GANs
èWe compare many derivative-free optimization methods
èWe point out that humans do much better than any similarity measure (in
terms of performance for a limited number of intermediate forward passes).
OVERALL: SELECT THE 5 BEST (REPLACEMENT OK) OUT OF 28 and average them
(then repeat, with random perturbation)
è Easy for humans
è Fast convergence
è End to end for all kinds of data

What else than facial composite ?
A cool fashion generator J
The instruction was respectively to produce “Sportswear”, “Clothes for cold weather”, “Light
clothes”, ”Sophisticated”. 61 images were generated in each case, i.e. 4 generations of 15
images plus the initial one.

. 2019
& our Madagascar friends
& Konstance guys
Evolutionary GANs

Super resolution
When training with noise injection:
high-resolution = ConditionalGAN(LR, noise)
At inference time, typically noise=0.
Let us try noise maximizing:
- QualityEstimator(ConditionalGAN(LR,noise))
- Discriminator(ConditionalGAN(LR,noise))
- -L2(noise) (regularization)
. 2019
Koncept512

. 2019
+Vegard Mella
+Qucheng Gong
+Hengyuan Hu
+Xavier Martinet
+Vasil Khalidov
Evolutionary GANs
Because evolution è
robustness to distribution shift
+ parallel + gradient free
Because evolution è
robustness to distribution shift
+ parallel
The best contributions to Nevergrad will
be rewarded (conference grants) è join
us ! Can be a huge technical code
contribution, or a great one-line idea J

Open scalable generic Zero
learning: Polygames @ FB.
Apr. 2019
Alpha Go and Alpha Zero are great.
But
- they use quite a lot of self-play data
- there are still games in which humans are stronger than computers
(global criterion, multiple goals, non-squared locations).
- not all games can be zero-learnt
èScientific innovations needed
èOpen source platform needed

Coulom (06)
Chaslot, Saito & Bouzy (06)
Kocsis Szepesvari (06)
UCT (Upper Confidence Trees) starts with
simple monte carlo
(Monte Carlo)
Monte Carlo …

UCT
(Monte Carlo)
Monte Carlo … and
keep track of
statistics!

UCT
(we have
statistics!)
(Monte Carlo)

Exploitation ...
Monte Carlo, and
build statistics… and
modify MC with
those statistics!

Exploitation ...
SCORE =
5/7
+ k.sqrt( log(10)/7 )

AlphaZero ingredient #1: Monte Carlo: random exploration of
possible futures.

Better than simple MC: use the statistics !
SCORE =
5/7
+ k.sqrt( log(10)/7 )
AlphaZero ingredient #2: Monte Carlo Tree
Search (a.k.a. adaptive Monte Carlo)

... or exploration ?
SCORE =
0/2
+ k.sqrt( log(10)/2 )

SCORE =
0/2
+ k.sqrt( log(10)/2 )
Why the second term ?
Why the first term ?

SCORE =
0/2
+ k.sqrt( log(10)/2 )
è For exploring everything, eventually

SCORE =
0/2
+ k.sqrt( log(10)/2 )
è For exploring everything, eventually
è For more simulations in good directions.

UCT in one slide
UCT for choosing a move in a board B
While ( I have time left )
{
Do a simulation
{
Start at board B
At each time step, choose action by UCB (or random if no statistics!)
}
Update statistics with this simulation
}
Return the most simulated action.

AlphaZero ingredient #3: deep network
Overview in “Deep learning”, LeCun, Bengio, Hinton 2015
Both a critic network (evaluating the probability of winning in a given
position) and a policy network (providing a probability distribution on
actions).
Image “clarifai.com/technology” and “the data science blog”
ß ß ß Invariance by translation à à à High level features

PUCT: a variant of MCTS with neural prior
SCORE(state, action) =
5/7
+ NN(state, action) .sqrt( 10 / 7 )

AlphaZero in a nutshell : a fix point method!
MCTS(NN): a MCTS which uses a neural net NN for
• Evaluating leaves (no random rollout)
• Suggesting policies (biasing the MCTS)
NN ß MCTS:
• Each client: plays games with a MCTS(NN)
• Server:
• receives batches “(states, actions, reward at end of games)”
• Two loss functions (+weight decay):
• Learn “state à reward” (critic)
• Learn “state à probability distribution on actions” (actor) , i.e. mimic the MCTS
ALPHAZERO:
• randomly initialize NN
• iteratively imitate: NN-actor è MCTS(NN) NN-Critic è game results
Prediction
of value
p imitates
𝜋 from
MCTS
Weight
decay

AlphaZero in a nutshell : a fix point method!
Prediction
of value
p imitates
𝜋 from
MCTS
Weight
decay
Neural network
MCTS using the
Neural network
(tree search +
neural net)
Neural network
trained using
the MCTS
results

Adding mutations in Zero learning ?
Original Zero learning = convolutions + fully connected layers.
Outputs of the network =
(1) A tensor Pi = logits of actions. Typically XxYxC, where XxY = board size
= same first two dims as inputs (as in dense image classification!).
(2) A float V = probability of winning.
Our claims:
(1) Fully connected layers have drawbacks for Pi = lose the local
invariance. Traditional zero = invariant by permutation of the
representation of actions on the board. As much non-sense as full
connections on images. As in dense image classification!
(2) Add global pooling (= fixed-length representation based on statistics
over the board for each channel)
è By 1+2, the network is boardsize-independent

Adding mutations in Zero learning ?
(3) Residual networks è adding
- layers (initialized close to 0)
- channels (initialized close to 0)
- kernel size (new entries close to 0)
è preserves the computed function à incremental zero learning.
è towards architecture search in zero learning.

HEX
According to Bonnet et al
(https://www.lamsade.dauphine.fr/~bonnet/publi/connection-
games.pdf), “Since its independent inventions in 1942 and 1948 by
the poet and mathematician Piet Hein and the economist and
mathematician John Nash, the game of hex has acquired a special
spot in the heart of abstract game aficionados. Its purity and depth
has lead Jack van Rijswijck to conclude his PhD thesis with the
following hyperbole [1]: << Hex has a Platonic existence,
independent of human thought. If ever we find an
extraterrestrial civilization at all, they will know hex, without
any doubt.>> ”

HEX
Simplest rules ever!
I play black.
You play white.
We put a stone in turn.
If I connect my sides, I win.
If you connect your sides, you win.
Theorem: no draw.
Until 2019/10/31: no computer managed to beat the best humans!

HEX
Polygames vs Arek Kulczycki
Bunch of GPUs, several days.
Operated & trained by Vegard, a.k.a
“un putain de hacker de ouf”. (winner last LG tournament, best
ELO-rank on the LittleGolem server)
Thanks a lot ! ! !

HEX
I play black.
You play white.
Theorem: no draw.
(Max Pixel)

HEX
I play black.
You play white.
Theorem: no draw.
(pngimg.com)

HEX
I play black.
You play white.
Theorem: no draw.
Fantastic game with a super
long final path!
TRAINED in 13x13, WON in
19x19

THE END !!!
… we’re coming in many
other games J
Havannah: big board,
diversity of winning
conditions, long games,
hexagons…
Let’s have a beer!
Contact:
oteytaud@fb.com

Evolutionary deep learning: computer vision.

More Related Content

Similar to Evolutionary deep learning: computer vision.

Recently uploaded

Evolutionary deep learning: computer vision.