Successfully reported this slideshow.
Upcoming SlideShare
×

# A machine learning method for efficient design optimization in nano-optics

131 views

Published on

The slideshow contains a brief explanation of Gaussian process regression and Bayesian optimization. For two optimization problems, benchmarks against other local gradient-based and global heuristic optimization methods are included. They show, that Bayesian optimization can identify better designs in exceptionally short computation times.

Published in: Data & Analytics
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### A machine learning method for efficient design optimization in nano-optics

1. 1. A machine learning method for efficient design optimization in nano-optics
2. 2. 2 Optical behavior of small structures (e.g. scattering in certain direction) dominated by diffraction, interference and resonance phenomena  Full solution of Maxwell’s equation required  Behavior only known implicitly (black-box function)  Computation of solution is time consuming (expensive black-box function) Computational challenges in nano-optics ?
3. 3. 3 Analysis of expensive black-box functions Typical questions: • Regression: What is the response 𝑓(𝑥) for unknown parameter values 𝑥 ? • Optimization: What are the best parameter values that lead to a measured/desired response? • Integration: What is the average response? System response (requires solution of Maxwell’s equations) k ω p1 p2 … Black-box function ? Isolated Scatterers Metamaterials Geometry Reconstruction k, ω
4. 4. 4 Regression models  Regression models are important tools to interpolate between known data points.  Further, they can be used for model-based optimization and numerical integration (quadrature).
5. 5. 5 + Accurate and data efficient + Reliable (provides uncertainties) + Interpretable results ‒ Computationally demanding but not as much as training neural networks Regression models (small selection) K-nearest neighbors Linear regression Support vector machine Random forest trees Gaussian process regression (Kriging) (Deep) neural networks [CE Rasmussen, “Gaussian processes in machine learning”. Advanced lectures on machine learning , Springer (2004)] [B. Shahriari et al., "Taking the Human Out of the Loop: A Review of Bayesian Optimization“. Proc. IEEE 104(1), 148 (2016)] Increasingpredictivepower andcomputationaldemands
6. 6. 6 Gaussian process regression How does it work?
7. 7. 7  Gaussian process (GP): distribution of functions in a continuous domain 𝒳 ⊂ ℝN  Defined by: mean function 𝜇: 𝒳 → ℝ and covariance function (kernel) 𝑘: 𝒳 × 𝒳 → ℝ  Training data: 𝑀 known function values 𝑓 𝑥1 , … , 𝑓 𝑥 𝑀 with corresponding covariance matrix 𝐊 = 𝑘 𝑥𝑖, 𝑥𝑗 𝑖,𝑗  Random function values at positions 𝐗∗ = (𝑥1 ∗ , … , 𝑥 𝑁 ∗ ): Multivariate Gaussian random variable 𝐘∗ ∼ 𝒩 𝛍, 𝚺 with probability density 𝑝 𝐘∗ = 1 2𝜋 𝑁/2 𝚺 1/2 exp − 1 2 𝐘∗ − 𝛍 𝑇 𝚺−1 𝐘∗ − 𝛍 , means, and covariance 𝛍𝑖 = 𝜇(𝑥𝑖 ∗ ) − 𝑘𝑙 𝑘 𝑥𝑖 ∗ , 𝑥 𝑘 𝐊 𝑘𝑙 −1 [𝑓 𝑥𝑙 − 𝜇 𝑥𝑙 ] 𝚺𝑖𝑗 = 𝑘 𝑥𝑖 ∗ , 𝑥𝑗 ∗ − 𝑘𝑙 𝑘 𝑥𝑖 ∗ , 𝑥 𝑘 𝐊 𝑘𝑙 −1 𝑘 𝑥𝑙, 𝑥𝑗 ∗ .  For a proof see: http://fourier.eng.hmc.edu/e161/lectures/gaussianprocess/node7.html Gaussian process regression
8. 8. 8 Gaussian process regression
9. 9. 9 In the following we don’t need correlated random vectors of function values, but just the probability distribution of a single function value 𝑦 at some 𝑥∗ ∈ 𝒳 This is simply a normal distribution 𝑦 ∼ 𝒩( 𝑦, 𝜎2 ) with mean and standard deviation 𝑦 = 𝜇 𝑥∗ + 𝑖𝑗 𝑘 𝑥∗, 𝑥𝑖 𝐊 𝑖𝑗 −1 [𝑓 𝑥𝑗 − 𝜇 𝑥𝑗 ] 𝜎2 = 𝑘 𝑥∗, 𝑥∗ − 𝑖𝑗 𝑘(𝑥∗, 𝑥𝑖) 𝐊 𝑖𝑗 −1 𝑘(𝑥𝑗, 𝑥∗) Gaussian process regression
10. 10. 10 Gaussian-process regression
11. 11. 11 The mean and covariance function are usually parametrized as 𝜇 𝑥 = 𝜇0 𝑘 𝑥, 𝑥′ = 𝜎2 𝐶5/2 𝑟 = 𝜎2 1 + 5𝑟 + 5 3 𝑟2 exp − 5𝑟 with 𝑟2 = 𝑖 𝑥𝑖 − 𝑥𝑖 ′ 2/𝑙𝑖 2 Take values of 𝜇0, 𝜎, 𝑙𝑖 are maximized w.r.t. the log-likelihood of the observations: log 𝑃 𝐘 = − 𝑀 2 log 2𝜋 − 1 2 log 𝐊 − 1 2 𝐘 − 𝛍 𝑇 𝐊−1(𝐘 − 𝛍) GP hyperparameters Matern-5/2 function
12. 12. 12 Bayesian optimization Use Gaussian process regression to run optmization or parameter reconstruction
13. 13. 13 Problem: Find parameters 𝑥 ∈ 𝒳 that minimize 𝑓 𝑥 . For the currently known smallest function value 𝑦 𝑚𝑖𝑛 we define the improvement 𝐼 𝑦 = 0 ∶ 𝑦 ≥ 𝑦 𝑚𝑖𝑛 𝑦 𝑚𝑖𝑛 − 𝑦 ∶ y < 𝑦 𝑚𝑖𝑛 We sample at points of largest expected improvement 𝛼EI(𝑥) = 𝔼 𝐼(𝑦) (analytic function derived from normal distribution of 𝑦) Bayesian optimization
14. 14. 14 Problem: Find parameters 𝑥 ∈ 𝒳 that minimize 𝑓 𝑥 . For the currently known smallest function value 𝑦 𝑚𝑖𝑛 we define the improvement 𝐼 𝑦 = 0 ∶ 𝑦 ≥ 𝑦 𝑚𝑖𝑛 𝑦 𝑚𝑖𝑛 − 𝑦 ∶ y < 𝑦 𝑚𝑖𝑛 We sample at points of largest expected improvement 𝛼EI(𝑥) = 𝔼 𝐼(𝑦) (analytic function derived from normal distribution of 𝑦) Bayesian optimization
15. 15. 15 Problem: Find parameters 𝑥 ∈ 𝒳 that minimize 𝑓 𝑥 . For the currently known smallest function value 𝑦 𝑚𝑖𝑛 we define the improvement 𝐼 𝑦 = 0 ∶ 𝑦 ≥ 𝑦 𝑚𝑖𝑛 𝑦 𝑚𝑖𝑛 − 𝑦 ∶ y < 𝑦 𝑚𝑖𝑛 We sample at points of largest expected improvement 𝛼EI(𝑥) = 𝔼 𝐼(𝑦) (analytic function derived from normal distribution of 𝑦) Bayesian optimization
16. 16. 16 Problem: Find parameters 𝑥 ∈ 𝒳 that minimize 𝑓 𝑥 . For the currently known smallest function value 𝑦 𝑚𝑖𝑛 we define the improvement 𝐼 𝑦 = 0 ∶ 𝑦 ≥ 𝑦 𝑚𝑖𝑛 𝑦 𝑚𝑖𝑛 − 𝑦 ∶ y < 𝑦 𝑚𝑖𝑛 We sample at points of largest expected improvement 𝛼EI(𝑥) = 𝔼 𝐼(𝑦) (analytic function derived from normal distribution of 𝑦) Bayesian optimization
17. 17. 17 Problem: Find parameters 𝑥 ∈ 𝒳 that minimize 𝑓 𝑥 . For the currently known smallest function value 𝑦 𝑚𝑖𝑛 we define the improvement 𝐼 𝑦 = 0 ∶ 𝑦 ≥ 𝑦 𝑚𝑖𝑛 𝑦 𝑚𝑖𝑛 − 𝑦 ∶ y < 𝑦 𝑚𝑖𝑛 We sample at points of largest expected improvement 𝛼EI(𝑥) = 𝔼 𝐼(𝑦) (analytic function derived from normal distribution of 𝑦) Bayesian optimization
18. 18. 18 Problem: Find parameters 𝑥 ∈ 𝒳 that minimize 𝑓 𝑥 . For the currently known smallest function value 𝑦 𝑚𝑖𝑛 we define the improvement 𝐼 𝑦 = 0 ∶ 𝑦 ≥ 𝑦 𝑚𝑖𝑛 𝑦 𝑚𝑖𝑛 − 𝑦 ∶ y < 𝑦 𝑚𝑖𝑛 We sample at points of largest expected improvement 𝛼EI(𝑥) = 𝔼 𝐼(𝑦) (analytic function derived from normal distribution of 𝑦) Bayesian optimization
19. 19. 19 Problem: Find parameters 𝑥 ∈ 𝒳 that minimize 𝑓 𝑥 . For the currently known smallest function value 𝑦 𝑚𝑖𝑛 we define the improvement 𝐼 𝑦 = 0 ∶ 𝑦 ≥ 𝑦 𝑚𝑖𝑛 𝑦 𝑚𝑖𝑛 − 𝑦 ∶ y < 𝑦 𝑚𝑖𝑛 We sample at points of largest expected improvement 𝛼EI(𝑥) = 𝔼 𝐼(𝑦) (analytic function derived from normal distribution of 𝑦) Bayesian optimization
20. 20. 20 Problem: Find parameters 𝑥 ∈ 𝒳 that minimize 𝑓 𝑥 . For the currently known smallest function value 𝑦 𝑚𝑖𝑛 we define the improvement 𝐼 𝑦 = 0 ∶ 𝑦 ≥ 𝑦 𝑚𝑖𝑛 𝑦 𝑚𝑖𝑛 − 𝑦 ∶ y < 𝑦 𝑚𝑖𝑛 We sample at points of largest expected improvement 𝛼EI(𝑥) = 𝔼 𝐼(𝑦) (analytic function derived from normal distribution of 𝑦) Bayesian optimization For more and more data points in the local minimum: 𝛼EI 𝑥 → 0. Hence, we do not get trapped in local minima, but eventually jump out of them.
21. 21. 21 Utilizing derivatives The JCMsuite FEM solver can compute also derivatives w.r.t. geometric parameter, material parameters and others. We can use derivatives to train the GP because differentiation is a linear operator: • What is the mean function of the GP for derivative observations? 𝜇 𝐷 𝑥 ≡ 𝔼 𝛻𝑓 𝑥 = 𝛻𝔼 𝑓 𝑥 = 𝛻𝜇 𝑥 = 0 • What is the kernel function between an observation at 𝑥 and a derivative observation at 𝑥′ ? 𝑘 𝐷 𝑥, 𝑥′ ≡ cov 𝑓 𝑥 , 𝛻𝑓 𝑥′ = 𝔼 𝑓 𝑥 − 𝜇 𝑥 𝛻𝑓 𝑥′ − 𝜇 𝐷 𝑥′ = 𝛻𝑥′ 𝑘(𝑥, 𝑥′) • Analogously, the kernel function between a derivative observation at 𝑥 and a derivative observation at 𝑥′ is given as 𝑘 𝐷𝐷 𝑥, 𝑥′ ≡ cov 𝛻𝑓 𝑥 , 𝛻𝑓 𝑥′ = 𝛻𝑥 𝛻𝑥′ 𝑘(𝑥, 𝑥′)  We can build a large GP (i.e. a large mean vector and covariance matrix) containing observations of objective function and its derivatives
22. 22. 22 Utilizing derivatives without gradient with gradient Derivative observations can speed up Bayesian optimization.
23. 23. 23 Utilizing derivatives without gradient with gradient Derivative observations can speed up Bayesian optimization.
24. 24. 24 Utilizing derivatives without gradient with gradient Derivative observations can speed up Bayesian optimization.
25. 25. 25 Utilizing derivatives without gradient with gradient Derivative observations can speed up Bayesian optimization.
26. 26. 26 Utilizing derivatives without gradient with gradient Derivative observations can speed up Bayesian optimization.
27. 27. 27 Utilizing derivatives without gradient with gradient Derivative observations can speed up Bayesian optimization.
28. 28. 28 Utilizing derivatives without gradient with gradient Derivative observations can speed up Bayesian optimization.
29. 29. 29 Utilizing derivatives without gradient with gradient Derivative observations can speed up Bayesian optimization. minimum found
30. 30. 30 Utilizing derivatives without gradient with gradient Derivative observations can speed up Bayesian optimization. minimum found
31. 31. 31 Utilizing derivatives without gradient with gradient Derivative observations can speed up Bayesian optimization. minimum found
32. 32. 32 Utilizing derivatives without gradient with gradient Derivative observations can speed up Bayesian optimization. minimum found
33. 33. 33 Utilizing derivatives without gradient with gradient Derivative observations can speed up Bayesian optimization. minimum found
34. 34. 34 Utilizing derivatives without gradient with gradient minimum found Derivative observations can speed up Bayesian optimization.
35. 35. 35 Utilizing derivatives without gradient with gradient minimum found Derivative observations can speed up Bayesian optimization.
36. 36. 36 Utilizing derivatives without gradient with gradient minimum found Derivative observations can speed up Bayesian optimization.
37. 37. 37 Utilizing derivatives without gradient with gradient minimum found minimum found Derivative observations can speed up Bayesian optimization.
38. 38. 38 Solving arg max 𝑥 𝛼EI(𝑥) can be very time consuming. Bayesian optimization runs inefficiently if the sample computation takes longer then the objective function calculation (simulation) We use differential evolution to maximize 𝛼EI(𝑥) and adapt the effort (i.e. the population size and number of generations) to the simulation time. We calculate one sample in advance while the objective function is evaluated. See Schneider et al. arXiv:1809.06674 (2019) for details Making Bayesian optimization time efficient
39. 39. 39 Benchmark For the Rastrigin function we compare Bayesian optimization with other optimization methods
40. 40. 40 Rastrigin function  Defined on an 𝑛-dimensional domain as 𝑓 𝒙 = 𝐴𝑛 + 𝑖=1 𝑛 [𝑥𝑖 2 − 𝐴 cos(2𝜋𝑥𝑖)] with 𝐴 = 10. We use 𝑛 = 3 and 𝑥𝑖 ∈ [−2.5,2.5].  Sleeping for 10s during evaluation to make function call “expensive”.  Parallel minimization with 5 parallel evaluations of 𝑓 𝒙 . Global minimum 𝑓 𝑚𝑖𝑛 = 0 at 𝒙 = 0
41. 41. 41 Choice of optimization algorithms We compare the performance of Bayesian optimization (BO) with • Local optimization methods Gradient-based low-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS-B) started in parallel from 10 different locations • Global heuristic optimization Differential evolution (DE), Particle swarm optimization (PSO), Covariance matrix adaptation evolution strategy (CMA-ES) All optimization methods are run with standard parameters
42. 42. 42 Benchmark on Rastrigin function [Laptop with 2-core Intel Core I7 @ 2.7 GHz] BO converges significantely faster than other methods Although more elaborate, BO has no significant computation time overhead (total overhead approx. 3 min.)
43. 43. 43 Benchmark on Rastrigin function with derivatives [Laptop with 2-core Intel Core I7 @ 2.7 GHz] Derivative information speed up minimization BO with and without derivatives finds lower function values than multi-start L-BFGS-B with derivatives
44. 44. 44 Benchmark against open-source BO (scikit) [Laptop with 2-core Intel Core I7 @ 2.7 GHz] Comparison against Bayesian optimization of scikit-optimize (https://scikit-optimize.github.io/stable/) shows that the implemented sample computation methods lead to better samples in a drastically reduced computation time.
45. 45. 45 More benchmarks… More benchmarks for realistic photonic optimization problems can be found in the publication ACS Photonics 6 2726 (2019) https://arxiv.org/abs/1809.06674 • Single-Photon Source • Metasurface • Parameter reconstruction
46. 46. 46 Conclusion • Bayesian optimization is a highly efficient method for shape optimization • It can incorporate derivative information if available • It can be used for very expensive simulations but also for fast/parallelized simulations (e.g. one simulation result every two seconds)
47. 47. 47 Acknowledgements We are grateful to the following institutions for funding this research: • European Unions Horizon 2020 research and innovation programme under the Marie Sklodowska- Curie grant agreement No 675745 (MSCA-ITN-EID NOLOSS) • EMPIR programme co-nanced by the Participating States and from the European Unions Horizon 2020 research and innovation programme under grant agreement number 17FUN01 (Be-COMe). • Virtual Materials Design (VIRTMAT) project by the Helmholtz Association via the Helmholtz program Science and Technology of Nanosystems (STN). • Central Innovation Programme for SMEs of the German Federal Ministry for Economic Afairs and Energy on the basis of a decision by the German Bundestag (ZF4450901)
48. 48. 48 Resources  Description of FEM software JCMsuite  Getting started with JCMsuite  Tutorial on optimization with JCMsuite using Matlab®/Python  Free trial download of JCMsuite