6. 优化问题
Hyperparameter search--Bayesian Optimization
一组超参组合𝑋 = 𝑥1, 𝑥2, … , 𝑥𝑛,假设超参函数与模型优化地损失函数存在函数关系𝑓: 𝑥 → 𝑅,需要在𝑥 ∈ 𝑋
内找到
𝑥
∗
= 𝑎𝑟𝑔𝑚𝑖𝑛𝑓(𝑥)
在每一个Iteration t = 1, … , T 中,𝑥𝑡 ∈ 𝑋对应𝑓 𝑥𝑡 ,但大多情况下,只能观测到一个噪声值𝑦𝑡 = 𝑓 𝑥𝑡 + 𝜖,
𝜖 ∈ 𝑁 0, 𝜎2 ,加入观测数据集𝐷1:𝑡 = 𝑥1, 𝑦1 , … , 𝑥𝑡, 𝑦𝑡
• f is explicitly unknown and multimodal.
• Evaluations of f may be perturbed.
• Evaluations of f are expensive.
如何选择xt ?
6
7. Hyperparameter search--Bayesian Optimization
General idea: surrogate modelling
1. Use a surrogate model of f to carry out the optimization.
2. Define an utility function to collect new data points
satisfying some optimality criterion: optimization as
decision.
3. Study decision problems as inference using the surrogate
model: use a probabilistic model able to calibrate both,
epistemic and aleatoric uncertainty.
贝叶斯优化是一类近似逼近的方法,用各种代理函数(代理模型)来拟合超参数与模型评价之间的关系,
然后选择有希望的超参数组合进行迭代,最后得出效果最好的超参数组合。
7
8. Bayesian Optimization
Hyperparameter search--Bayesian Optimization
BO的核心
• 概率代理模型
高斯过程,random forest、t-student process等
• 采集函数
EI,MPI,LCB,KG,ES等
8
tradeoff between improving on an already good point and
evaluation new points in under explored areas
11. HyperBand
Intuition:
• Compare relative performance
• Terminate bad performing trials
• Continue better trials for longer period of time
Notes:
• Can be combined with Bayesian Optimization
• Can be easily parallelized
Hyperparameter search--HyperBand
框架:SUCCESSIVE HALVING
• requires the number of configurations n and budget B
as an input
• n vs
𝐵
𝑛
13
15. Hyperparameter search--PBT
Exploit
Truncation Selection:
• Rank all agents in the population
• If the current agent is in the bottom 20% of the population,
sample another agent uniformly from the top 20% of the
population and copy its weights and hyperparameters
Binary Tournament / T-Test Selection
• Uniformly sample another agent in the population
• If the sampled agent has a higher score, the weights and
hyperparameters are copied to replace the current agent
Explore
Perturb
• Each hyperparameter is independently randomly
perturbed by a factor of 1.2 or 0.8
Resample
• Each hyperparameter is resampled from the
original prior distribution defined with some
probability
17
16. Hyperparameter search--PBT
• the hyperparameters are clearly being
focused on the best part of the sampling
range, and adapted over time
• agents which are lucky in environment
exploration are quickly propagated to more
workers, meaning that all members of the
population benefit from the exploration luck
of the remainder of the population.
18
18. Population-based training
Main idea:
• Evaluate a population in parallel
• Terminate lowest performers
• Copy weights of the best performers and mutates hyperparameters
Benefits:
• Easily parallelizable
• Can search over ‘schedules’
• Terminates bad performers
Hyperparameter search
21
28. PopulationBasedTraining (PBT)
Ray.tune.schedulers
• 实现了基于种群的训练(PBT)算法。
• 并行地训练一组模型
• 性能差地模型定期clone性能最好地模型的状态,并对其超参进行random mutation,以期获得更好地模型
• Unlike other hyperparameter search algorithms, PBT mutates hyperparameters during training time
• If the number of trials exceeds the cluster capacity,they will be time-multiplexed as to balance training progress
31