hyperparamater search netowrk technnique

1.超参优化算法
2.超参优化工具
3.多标签分类实验结果
2

 传统手动调参，不能保证得到最佳的参数组合，也会消耗更多的时间
 两种常用的自动调参的方式：
• 并行搜索(parallel search)：Grid Search，Random Search 等，不能利用相互之间的参数优化信息
• 序列优化(sequential optimisation)：Bayesian Optimization 等，耗时
Random
Search
Grid
Search
Bayesian
Optimization
HyperBand
Population-
based
Training
Hyperparameter search
3

为网格中指定的所有给定超参数值的每个排列构建模
型，评估并选择最佳模型
Benefit:
• Explainable
• Easily parallelizable
Problem:
尝试每一种超参组合，并更具交叉验证分数选择最佳
组合，耗时低效
Hyperparameter search--Grid Search
4

随机搜索从超参数空间中随机选择参数组合，参数由
给定的固定迭代次数的情况下选择
Benefit:
• Better coverage on important parameters Easily
• Hard to beat on high dimensions
Problem:
• 不能保证给出的是最佳的参数组合
• 耗时低效
Hyperparameter search--Random Search
5

 优化问题
Hyperparameter search--Bayesian Optimization
一组超参组合𝑋 = 𝑥1, 𝑥2, … , 𝑥𝑛，假设超参函数与模型优化地损失函数存在函数关系𝑓: 𝑥 → 𝑅，需要在𝑥 ∈ 𝑋
内找到
𝑥
∗
= 𝑎𝑟𝑔𝑚𝑖𝑛𝑓(𝑥)
在每一个Iteration t = 1, … , T 中，𝑥𝑡 ∈ 𝑋对应𝑓 𝑥𝑡 ，但大多情况下，只能观测到一个噪声值𝑦𝑡 = 𝑓 𝑥𝑡 + 𝜖，
𝜖 ∈ 𝑁 0, 𝜎2 ，加入观测数据集𝐷1:𝑡 = 𝑥1, 𝑦1 , … , 𝑥𝑡, 𝑦𝑡
• f is explicitly unknown and multimodal.
• Evaluations of f may be perturbed.
• Evaluations of f are expensive.
如何选择xt ?
6

 General idea: surrogate modelling
1. Use a surrogate model of f to carry out the optimization.
2. Define an utility function to collect new data points
satisfying some optimality criterion: optimization as
decision.
3. Study decision problems as inference using the surrogate
model: use a probabilistic model able to calibrate both,
epistemic and aleatoric uncertainty.
贝叶斯优化是一类近似逼近的方法，用各种代理函数（代理模型）来拟合超参数与模型评价之间的关系，
然后选择有希望的超参数组合进行迭代，最后得出效果最好的超参数组合。
7

 Bayesian Optimization
BO的核心
• 概率代理模型
高斯过程，random forest、t-student process等
• 采集函数
EI，MPI，LCB，KG，ES等
8
tradeoff between improving on an already good point and
evaluation new points in under explored areas

Problem:
• 对于具有未知平滑度和有噪声的高维/非凸函数，这
类算法很难进行拟合和优化
• 结合了启发式算法的BO算法，很难做到并行化
• 效果不稳定
• 需要消耗大量资源及时间
Benefit:
• Can utilize prior information
• Semi-Parallelizable
11

Hyperparameter search--HyperBand
• 根据之前的搜索结果，推断出下一个值得探索的位置，更快找到最优解，如贝叶斯优化算法
• 加快在探索过程中每种超参组合的评价速度，以在相同的时间完成更多组超参的探索和评价，从
而更快找到超参数的目的，如HyperBand 12

 HyperBand
Intuition:
• Compare relative performance
• Terminate bad performing trials
• Continue better trials for longer period of time
Notes:
• Can be combined with Bayesian Optimization
• Can be easily parallelized
框架：SUCCESSIVE HALVING
• requires the number of configurations n and budget B
as an input
• n vs
𝐵
𝑛
13

 HyperBand
• r: 单个超参数组合实际所能分配的预算；
• R: 单个超参数组合所能分配的最大预算；
• smax: 用来控制总预算的大小。
• B: 总共的预算
• η: 用于控制每次迭代后淘汰参数设置的比例
• get_hyperparameter_configuration(n):采样得到n组
不同的超参数设置
• run_then_return_val_loss(t,ri):根据指定的参数设置
和预算计算valid loss。L表示在预算为ri的情况下
各个超参数设置的验证误差
14
𝐴 = 𝜋𝑟2

(a)中的序列优化过程只有一个模型在不断优化，消耗大量时间。
(b)中的并行搜索可以节省时间，但是相互之间没有任何交互，不利于信息利用。
(c)PBT算法结合了二者的优点
Hyperparameter search--PBT
15

 PBT
16

 Exploit
 Truncation Selection:
• Rank all agents in the population
• If the current agent is in the bottom 20% of the population,
sample another agent uniformly from the top 20% of the
population and copy its weights and hyperparameters
 Binary Tournament / T-Test Selection
• Uniformly sample another agent in the population
• If the sampled agent has a higher score, the weights and
hyperparameters are copied to replace the current agent
 Explore
 Perturb
• Each hyperparameter is independently randomly
perturbed by a factor of 1.2 or 0.8
 Resample
• Each hyperparameter is resampled from the
original prior distribution defined with some
probability
17

• the hyperparameters are clearly being
focused on the best part of the sampling
range, and adapted over time
• agents which are lucky in environment
exploration are quickly propagated to more
workers, meaning that all members of the
population benefit from the exploration luck
of the remainder of the population.
18

 Population-based training
Main idea:
• Evaluate a population in parallel
• Terminate lowest performers
• Copy weights of the best performers and mutates hyperparameters
Benefits:
• Can search over ‘schedules’
• Terminates bad performers
Hyperparameter search
21

https://docs.ray.io/en/latest/tune/index.html
https://github.com/ray-project/ray
https://deephyper.readthedocs.io/en/latest/
https://github.com/deephyper/deephyper
https://www.wandb.com/
https://github.com/wandb/client
https://docs.determined.ai/latest/tutorials/quick-start.html
https://github.com/determined-ai/determined
https://github.com/topics/hyperparameter-search
https://parameter-sherpa.readthedocs.io/en/latest/
https://github.com/sherpa-ai/sherpa
https://www.neuraxle.org/stable/index.html
https://github.com/Neuraxio/Neuraxle
https://developer.nvidia.com/blog/powering-automl-enabled-
ai-model-training-with-clara-train/
超参优化工具
22

https://districtdatalabs.silvrback.com/parameter-tuning-with-hyperopt
https://github.com/hyperopt
HyperOpt
https://optuna.readthedocs.io/en/stable/
https://github.com/optuna/optuna
https://github.com/tykimos/keras-tuner
Keras-Tuner
https://github.com/automl/HpBandSter
HpBandSter
超参优化工具
23

ray.readthedocs.io/en/latest/tune.html
Tune
Ray.tune: Distributed Hyperparameter Search
https://docs.ray.io/en/latest/index.html
 Ray: 是一种高度集成的Automl框架
 Ray.tune: 基于Ray分布式计算框架，集
成了多种超参优化方法，是拓展性强的
超参优化工具
24

Ray.tune
• 可扩展的搜索算法实现，如基于模型的优化（HyperOpt）和HyperBand
• 与可视化工具集成，如TensorBoard，rllab的VisKit和平行坐标可视化
• 灵活的试验性变量生成，包括网格搜索，随机搜索和条件参数分布
• 资源感知调度，包括支持并行运行的算法，这些算法本身可以并行和分布
25

• Tune 可接受用户定义的python function或者class
• 支持多个Trails并行运算（Trail由Schedulers进行安排和管理）
底层是可选的参数搜索算法
上层是Tune两种模
型训练方式的API
Ray.tune
trail：每组超参配置（hyperparameter configurations）组成的评估
26

 Tune两种模型训练方式的API：
Ray.tune
27

Ray.tune
• 定义神经网络函数，如my_train_func
• 编写实例参数（包括算法参数）—搜索空间，如train_spec
• 选择调度算法，如
scheduler = HyperBandScheduler(
time_attr="training_iteration", # 时间单位，绑定max_t
reward_attr="mean_accuracy", # 目标值
max_t=400)
• 选择搜索算法，如
algo = HyperOptSearch(space, max_concurrent=4, reward_attr="mean_accuracy")
• 调用搜索和调度算法，运行各个参数实例
tune.run_experiments(train_spec, search_alg=algo, scheduler=scheduler)
• 从搜索过程中获取最佳模型
best_model = Tune.get_best_model(My_Model._build_model(), trials, metric="mean_accuracy")
train_spec = {
"run": my_train_func,
"trial_resources": {"cpu": 20,"gpu": 2 }, # 异步搜索算法下，有
几个gpu就可以同时训练几个模型
"stop": { # 停止条件（可以指定里面的任何指标）
"mean_accuracy": 0.2,
"training_iteration": 10, # 迭代次数
"stop_loss_num": 5, }, #loss有5次不下降我们就结束
"config": {
"checkpoint_dir": "checkpoint_dir",
"epochs": 1,
"batch_size": 64,
"lr": grid_search([10**-4, 10**-5]), # 使用网格参数搜索
"decay": lambda spec: spec.config.lr / 100.0, # Tune还支
持用户指定的lambda函数的采样参数
"dropout": grid_search([0.25, 0.5]), },
"num_samples": 4, #这指定了计划运行的试验次数,从超参数
空间中采样的次数，而不是批次的大小。 }
28

Ray.tune
TrialRunner是最核心的数据结构，它管理一系列的
Trial对象，并且执行一个事件循环，将这些任务通
过TrialExecutor提交到Ray cluster运行
RayTrialExecutor会负责资源的管理
TrialScheduler：调度器
SearchAlgorithm：搜索算法，
（默认为BasicVariantGenerator）
主要用于产生新的参数
• Ray.tune主流程
29

 Tune Algorithm Offerings
Trial Schedulers
Provided
Search Algorithms
Provided
• Population-based Training
• HyperBand
• ASHA
• Median-stopping Rule
• BOHB
• HyperOpt(TPE)
• Bayesian Optimization
• SigOpt
• Nevergrad
• Scikit-Optimize
• Ax/Botorch (PyTorch BayesOpt)
Ray.tune
scheduler=Scheduler(metric="accuracy", mode="max") 30

 PopulationBasedTraining (PBT)
Ray.tune.schedulers
• 实现了基于种群的训练(PBT)算法。
• 并行地训练一组模型
• 性能差地模型定期clone性能最好地模型的状态，并对其超参进行random mutation，以期获得更好地模型
• Unlike other hyperparameter search algorithms, PBT mutates hyperparameters during training time
• If the number of trials exceeds the cluster capacity，they will be time-multiplexed as to balance training progress
31

Ray.tune.schedulers
TrialRunner会调用PopulationBasedTraining的on_trial_result()函数，其主要流程
如下：
1. 如果离上次扰动的时间还没到指定间隔，则返回让该Trial继续训练。
2. 调用_quantiles()函数按设定的比例__quantile_fraction得到所有Trial中表现
好的头部和表现不好的尾部。
3. 如果当前trial是比较好的那一批，那存成checkpoint，等着被其它trial克隆
学习。
4. 如果很不幸地，当前trial属于比较差的那一批，那就从好的那批中随机挑
一个（为trial_to_clone），然后调用_exploit()函数。该函数会调用explore()
函数对trial_to_clone进行扰动，然后将它的参数设置和checkpoint设置到当
前trial。这样，当前trial就“洗心革面”，重新出发了。
5. 如果TrialRunner中有PENDING和PAUSED状态的trial，则请求暂停当前trial
，让出资源。否则的话就继续训练着
 PopulationBasedTraining (PBT)
32

实验数据
1533
755
597
516
574
440
525
0 200 400 600 800 1000 1200 1400 1600 1800
opacity
diabetic retinopathy
glaucoma
macular edema
macular degeneration
retinal vascular occlusion
normal
总的数据标签分布
kaggle公开集:vietai-advance-retinal-disease-detection-2020
训练集:2577
验证机:859
33

实验结果
baseline
Population based training
(num_sample=20) time=11h
Hyperband (num_samples=50)
time = 12h
BO(num_samples=50)
time = 29h
Label sp se f1 matrix sp se f1 matrix sp se f1 matrix sp se f1 matrix
o 0.81 0.90 0.83
398 93
38 330
0.86 0.92 0.88
424 67
29 339
0.81 0.96 0.87
399 92
14 354
0.86 0.91 0.87
422 69
33 335
dr 0.96 0.84 0.84
642 29
30 158
0.99 0.89 0.92
661 10
20 168
0.97 0.87 0.88
653 18
25 163
0.98 0.86 0.89
658 13
26 162
g 0.99 0.69 0.79
702 10
45 102
0.97 0.88 0.88
693 19
17 130
0.98 0.81 0.85
699 13
28 119
0.99 0.83 0.88
703 9
25 122
me 0.95 0.74 0.71
700 40
31 88
0.95 0.78 0.74
701 39
26 93
0.95 0.79 0.74
700 40
25 94
0.95 0.75 0.73
704 36
30 89
md 0.96 0.84 0.84
680 25
24 130
0.96 0.89 0.86
679 26
17 137
0.97 0.87 0.87
684 21
20 134
0.97 0.85 0.86
685 20
23 131
rvo 0.98 0.71 0.77
745 12
30 72
0.99 0.85 0.88
748 9
15 87
0.99 0.81 0.87
751 6
19 83
0.99 0.84 0.88
749 8
16 86
n 1 0.92 0.96
727 0
11 121
0.99 0.94 0.96
724 3
8 124
0.99 0.93 0.96
725 2
9 123
0.99 0.95 0.94
718 9
6 126
mean 0.95 0.806 0.82 -- 0.96 0.879 0.874 -- 0.95 0.863 0.863 -- 0.96 0.856 0.864 --
kaggl
e
0.8099 0.86501 0.85883 34

Label precision recall f1 matrix sp se f1 matrix
o 0.83 0.94 0.87
409 82
22 346
0.86 0.92 0.88
424 67
29 339
dr 0.95 0.87 0.85
639 32
24 164
0.99 0.89 0.92
661 10
20 168
g 0.98 0.81 0.84
696 16
28 119
0.97 0.88 0.88
693 19
17 130
me 0.93 0.76 0.69
688 52
28 91
0.95 0.78 0.74
701 39
26 93
md 0.96 0.88 0.86
680 25
19 135
0.96 0.89 0.86
679 26
17 137
rvo 0.99 0.8 0.85
749 8
20 82
0.99 0.85 0.88
748 9
15 87
n 0.99 0.94 0.95
721 6
8 124
0.99 0.94 0.96
724 3
8 124
mean 0.95 0.857 0.844 -- 0.96 0.879 0.874 --
kaggl
e
0.84426 0.86501
实验结果
35

time = 12h 8gpu
HB_aug_without_pretrain_model
(num_samples=50)
HB_aug_with_pretrain_model
(num_samples=50)
time = 17h 6gpu
Label sp se f1 matrix sp se f1 matrix sp se f1 matrix
o 0.81 0.96 0.87
399 92
14 354
0.88 0.91 0.88
430 61
33 335
0.87 0.91 0.88
428 63
32 336
dr 0.97 0.87 0.88
653 18
25 163
0.97 0.85 0.88
654 17
28 160
0.97 0.86 0.88
653 18
27 161
g 0.98 0.81 0.85
699 13
28 119
0.99 0.82 0.87
702 10
27 120
0.98 0.81 0.85
699 13
28 119
me 0.95 0.79 0.74
700 40
25 94
0.96 0.79 0.78
711 29
25 94
0.97 0.76 0.77
715 25
29 90
md 0.97 0.87 0.87
684 21
20 134
0.97 0.88 0.88
687 18
19 135
0.96 0.89 0.86
679 26
17 137
rvo 0.99 0.81 0.87
751 6
19 83
0.99 0.85 0.88
749 8
15 87
0.99 0.86 0.89
750 7
14 88
n 0.99 0.93 0.96
725 2
9 123
0.99 0.95 0.96
722 5
6 126
0.99 0.94 0.96
724 3
8 124
mean 0.951 0.863 0.863 -- 0.96 0.864 0.876 -- 0.961 0.866 0.87 --
kaggl
e
0.85883
实验结果
36

PBT (num_samples=20)
实验结果
BO (num_samples=20)
37

hb_aug (num_samples=50) 过拟合
实验结果
38

hb_aug_with_pretrain_model
(num_samples=50)
实验结果
39

hyperparamater search netowrk technnique

Recommended

Recommended

More Related Content

Featured

Featured (20)

hyperparamater search netowrk technnique

Editor's Notes