A machine learning method for efficient design optimization in nano-optics

A machine learning method
for efficient design
optimization in nano-optics

2
Optical behavior of small structures (e.g. scattering in certain
direction) dominated by diffraction, interference and resonance
phenomena
 Full solution of Maxwell’s equation required
 Behavior only known implicitly (black-box
function)
 Computation of solution is time
consuming (expensive
black-box function)
Computational challenges in nano-optics
?

3
Analysis of expensive black-box functions
Typical questions:
• Regression: What is the response 𝑓(𝑥)
for unknown parameter values 𝑥 ?
• Optimization: What are the best
parameter values that lead to a
measured/desired response?
• Integration: What is the average
response?
System response
(requires solution
of Maxwell’s
equations)
k
ω
p1
p2
…
Black-box function
?
Isolated Scatterers Metamaterials Geometry Reconstruction
k, ω

4
Regression models
 Regression models are important
tools to interpolate between
known data points.
 Further, they can be used for
model-based optimization and
numerical integration
(quadrature).

5
+ Accurate and data
efficient
+ Reliable (provides
uncertainties)
+ Interpretable results
‒ Computationally
demanding but not as
much as training neural
networks
Regression models (small selection)
K-nearest neighbors
Linear regression
Support vector machine
Random forest trees
Gaussian process
regression (Kriging)
(Deep) neural networks
[CE Rasmussen, “Gaussian processes in machine learning”. Advanced lectures on machine
learning , Springer (2004)]
[B. Shahriari et al., "Taking the Human Out of the Loop: A Review of Bayesian Optimization“.
Proc. IEEE 104(1), 148 (2016)]
Increasingpredictivepower
andcomputationaldemands

6
Gaussian process regression
How does it work?

7
 Gaussian process (GP): distribution of functions in a continuous domain 𝒳 ⊂ ℝN
 Defined by: mean function 𝜇: 𝒳 → ℝ and covariance function (kernel) 𝑘: 𝒳 × 𝒳 → ℝ
 Training data: 𝑀 known function values 𝑓 𝑥1 , … , 𝑓 𝑥 𝑀 with corresponding
covariance matrix 𝐊 = 𝑘 𝑥𝑖, 𝑥𝑗 𝑖,𝑗
 Random function values at positions 𝐗∗
= (𝑥1
∗
, … , 𝑥 𝑁
∗
):
Multivariate Gaussian random variable 𝐘∗
∼ 𝒩 𝛍, 𝚺 with probability density
𝑝 𝐘∗
=
1
2𝜋 𝑁/2 𝚺 1/2
exp −
1
2
𝐘∗
− 𝛍 𝑇
𝚺−1
𝐘∗
− 𝛍 ,
means, and covariance
𝛍𝑖 = 𝜇(𝑥𝑖
∗
) −
𝑘𝑙
𝑘 𝑥𝑖
∗
, 𝑥 𝑘 𝐊 𝑘𝑙
−1
[𝑓 𝑥𝑙 − 𝜇 𝑥𝑙 ]
𝚺𝑖𝑗 = 𝑘 𝑥𝑖
∗
, 𝑥𝑗
∗
−
𝑘𝑙
𝑘 𝑥𝑖
∗
, 𝑥 𝑘 𝐊 𝑘𝑙
−1
𝑘 𝑥𝑙, 𝑥𝑗
∗
.
 For a proof see:
http://fourier.eng.hmc.edu/e161/lectures/gaussianprocess/node7.html

9
In the following we don’t need correlated random vectors of
function values, but just the probability distribution of a
single function value 𝑦 at some 𝑥∗
∈ 𝒳
This is simply a normal distribution 𝑦 ∼ 𝒩( 𝑦, 𝜎2
) with mean
and standard deviation
𝑦 = 𝜇 𝑥∗ +
𝑖𝑗
𝑘 𝑥∗, 𝑥𝑖 𝐊 𝑖𝑗
−1
[𝑓 𝑥𝑗 − 𝜇 𝑥𝑗 ]
𝜎2 = 𝑘 𝑥∗, 𝑥∗ − 𝑖𝑗 𝑘(𝑥∗, 𝑥𝑖) 𝐊 𝑖𝑗
−1
𝑘(𝑥𝑗, 𝑥∗)

10
Gaussian-process regression

11
The mean and covariance function
are usually parametrized as
𝜇 𝑥 = 𝜇0
𝑘 𝑥, 𝑥′ = 𝜎2 𝐶5/2 𝑟 = 𝜎2 1 + 5𝑟 +
5
3
𝑟2 exp − 5𝑟
with 𝑟2 = 𝑖 𝑥𝑖 − 𝑥𝑖
′ 2/𝑙𝑖
2
Take values of 𝜇0, 𝜎, 𝑙𝑖 are maximized w.r.t. the log-likelihood of the
observations:
log 𝑃 𝐘 = −
𝑀
2
log 2𝜋 −
1
2
log 𝐊 −
1
2
𝐘 − 𝛍 𝑇 𝐊−1(𝐘 − 𝛍)
GP hyperparameters
Matern-5/2 function

12
Bayesian optimization
Use Gaussian process
regression to run
optmization or parameter
reconstruction

13
Problem: Find parameters 𝑥 ∈ 𝒳 that minimize 𝑓 𝑥 . For the currently known smallest
function value 𝑦 𝑚𝑖𝑛 we define the improvement
𝐼 𝑦 =
0 ∶ 𝑦 ≥ 𝑦 𝑚𝑖𝑛
𝑦 𝑚𝑖𝑛 − 𝑦 ∶ y < 𝑦 𝑚𝑖𝑛
We sample at points of largest expected improvement 𝛼EI(𝑥) = 𝔼 𝐼(𝑦) (analytic function
derived from normal distribution of 𝑦)

14
𝐼 𝑦 =
0 ∶ 𝑦 ≥ 𝑦 𝑚𝑖𝑛

15
𝐼 𝑦 =
0 ∶ 𝑦 ≥ 𝑦 𝑚𝑖𝑛

16
𝐼 𝑦 =
0 ∶ 𝑦 ≥ 𝑦 𝑚𝑖𝑛

17
𝐼 𝑦 =
0 ∶ 𝑦 ≥ 𝑦 𝑚𝑖𝑛

18
𝐼 𝑦 =
0 ∶ 𝑦 ≥ 𝑦 𝑚𝑖𝑛

19
𝐼 𝑦 =
0 ∶ 𝑦 ≥ 𝑦 𝑚𝑖𝑛

20
𝐼 𝑦 =
0 ∶ 𝑦 ≥ 𝑦 𝑚𝑖𝑛
For more and more data points in the local
minimum: 𝛼EI 𝑥 → 0.
Hence, we do not get trapped in local
minima, but eventually jump out of them.

21
Utilizing derivatives
The JCMsuite FEM solver can compute also derivatives w.r.t. geometric
parameter, material parameters and others. We can use derivatives to train the
GP because differentiation is a linear operator:
• What is the mean function of the GP for derivative observations?
𝜇 𝐷 𝑥 ≡ 𝔼 𝛻𝑓 𝑥 = 𝛻𝔼 𝑓 𝑥 = 𝛻𝜇 𝑥 = 0
• What is the kernel function between an observation at 𝑥 and a derivative
observation at 𝑥′
?
𝑘 𝐷 𝑥, 𝑥′ ≡ cov 𝑓 𝑥 , 𝛻𝑓 𝑥′ = 𝔼 𝑓 𝑥 − 𝜇 𝑥 𝛻𝑓 𝑥′ − 𝜇 𝐷 𝑥′ = 𝛻𝑥′ 𝑘(𝑥, 𝑥′)
• Analogously, the kernel function between a derivative observation at 𝑥 and a
derivative observation at 𝑥′
is given as
𝑘 𝐷𝐷 𝑥, 𝑥′
≡ cov 𝛻𝑓 𝑥 , 𝛻𝑓 𝑥′
= 𝛻𝑥 𝛻𝑥′ 𝑘(𝑥, 𝑥′)
 We can build a large GP (i.e. a large mean vector and covariance matrix)
containing observations of objective function and its derivatives

22
without gradient
with gradient
Derivative observations can speed up Bayesian optimization.

23
without gradient
with gradient

24
without gradient
with gradient

25
without gradient
with gradient

26
without gradient
with gradient

27
without gradient
with gradient

28
without gradient
with gradient

29
without gradient
with gradient
minimum found

30
without gradient
with gradient
minimum found

31
without gradient
with gradient
minimum found

32
without gradient
with gradient
minimum found

33
without gradient
with gradient
minimum found

34
without gradient
with gradient
minimum found

35
without gradient
with gradient
minimum found

36
without gradient
with gradient
minimum found

37
without gradient
with gradient
minimum found
minimum found

38
Solving arg max
𝑥
𝛼EI(𝑥) can be very time consuming.
Bayesian optimization runs inefficiently if the sample computation
takes longer then the objective function calculation (simulation)
We use differential evolution to maximize 𝛼EI(𝑥) and adapt
the effort (i.e. the population size and number of generations)
to the simulation time.
We calculate one sample in advance while the objective
function is evaluated.
See Schneider et al. arXiv:1809.06674 (2019) for details
Making Bayesian optimization time efficient

39
Benchmark
For the Rastrigin function
we compare Bayesian
optimization with other
optimization methods

40
Rastrigin function
 Defined on an 𝑛-dimensional domain as 𝑓 𝒙 = 𝐴𝑛 + 𝑖=1
𝑛
[𝑥𝑖
2
− 𝐴 cos(2𝜋𝑥𝑖)] with
𝐴 = 10. We use 𝑛 = 3 and 𝑥𝑖 ∈ [−2.5,2.5].
 Sleeping for 10s during evaluation to make function call “expensive”.
 Parallel minimization with 5 parallel evaluations of 𝑓 𝒙 .
Global minimum 𝑓 𝑚𝑖𝑛 = 0 at 𝒙 = 0

41
Choice of optimization algorithms
We compare the performance of
Bayesian optimization (BO) with
• Local optimization methods
Gradient-based low-memory
Broyden-Fletcher-Goldfarb-Shanno
(L-BFGS-B) started in parallel from
10 different locations
• Global heuristic optimization
Differential evolution (DE),
Particle swarm optimization (PSO), Covariance matrix
adaptation evolution strategy (CMA-ES)
All optimization methods are run with standard parameters

42
Benchmark on Rastrigin function
[Laptop with 2-core Intel Core I7 @ 2.7 GHz]
BO converges significantely faster than other methods
Although more elaborate, BO has no significant computation time
overhead (total overhead approx. 3 min.)

43
Benchmark on Rastrigin function with derivatives
Derivative information speed up minimization
BO with and without derivatives finds lower function values than
multi-start L-BFGS-B with derivatives

44
Benchmark against open-source BO (scikit)
Comparison against Bayesian optimization of scikit-optimize
(https://scikit-optimize.github.io/stable/) shows that the
implemented sample computation methods lead to better samples in
a drastically reduced computation time.

45
More benchmarks…
More benchmarks for realistic
photonic optimization problems can
be found in the publication
ACS Photonics 6 2726 (2019)
https://arxiv.org/abs/1809.06674
• Single-Photon Source
• Metasurface
• Parameter reconstruction

46
Conclusion
• Bayesian optimization is a highly
efficient method for shape
optimization
• It can incorporate derivative
information if available
• It can be used for very expensive
simulations but also for
fast/parallelized simulations (e.g.
one simulation result every two
seconds)

47
Acknowledgements
We are grateful to the following institutions for funding
this research:
• European Unions Horizon 2020 research and
innovation programme under the Marie Sklodowska-
Curie grant agreement No 675745 (MSCA-ITN-EID
NOLOSS)
• EMPIR programme co-nanced by the Participating
States and from the European Unions Horizon 2020
research and innovation programme under grant
agreement number 17FUN01 (Be-COMe).
• Virtual Materials Design (VIRTMAT) project by the
Helmholtz Association via the Helmholtz program
Science and Technology of Nanosystems (STN).
• Central Innovation Programme for SMEs of the
German Federal Ministry for Economic Afairs and
Energy on the basis of a decision by the German
Bundestag (ZF4450901)

48
Resources
 Description of FEM software
JCMsuite
 Getting started with JCMsuite
 Tutorial on optimization with
JCMsuite using Matlab®/Python
 Free trial download of JCMsuite

A machine learning method for efficient design optimization in nano-optics

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to A machine learning method for efficient design optimization in nano-optics

Similar to A machine learning method for efficient design optimization in nano-optics (20)

Recently uploaded

Recently uploaded (20)

A machine learning method for efficient design optimization in nano-optics