High-Dimensional Bayesian Optimization with Constraints: Application to Powder weighing (PDPAT2022/MPS139)

© Mitsubishi Electric Corporation
High-Dimensional Bayesian Optimization with Constraints:
Application to Powder Weighing
Shoki Miyagawa, Atsuyoshi Yano, Naoko Sawada, Isamu Ogawa
(Mitsubishi Electric Corporation)

Background
Bayesian optimization (BO) can explore optimal parameters in black-box problems in limited trials.
2/32
Black-box
model
𝑥new
Bayesian
optimization
Input
parameters
Output
(to be maximized)
𝑦new
𝑦 𝑦
𝒙𝐧𝐞𝐰 𝑥
𝑥
𝑦
𝒚𝐧𝐞𝐰

Background
Bayesian optimization (BO) can explore optimal parameters in black-box problems in limited trials.
However, it cannot work for high-dimensional parameters (typically > 10) because the exploration
area is too wide.
3/32
Black-box
model
𝑥new
Bayesian
optimization
Input
parameters
Output
(to be maximized)
𝑦new
𝑦 𝑦
𝑥
𝑦
𝒚𝐧𝐞𝐰

Background
Related works explore parameters in a low-dimensional space acquired by the following methods.
4/32
REMBO [IJCAI’13]
LINEBO [ICML’19]
Efficient exploration
Constraints cannot be explicitly
expressed in the latent space.
Constraints can be easily introduced
Not efficient exploration
(particularly for image and NLP)
Dropout [IJCAI’17]
Original
Dropped
Subspace extraction / Linear embedding
𝒙𝐧𝐞𝐰
𝒙𝐢𝐧𝐢𝐭
BO in non-dropped
dimensions.
BO in
a single dimension.
BO in random
embedded dimensions.
Nonlinear embedding
2. BO in the latent space
𝑦
𝒛𝐧𝐞𝐰 𝒛
Encoder
(DNN)
𝒙𝟏
𝒙𝟐
𝒙𝟑
𝒛𝟏
𝒛𝟐
𝒛𝟑
Decoder
(DNN)
𝒛𝐧𝐞𝐰
𝒙𝐧𝐞𝐰
1. Encode datasets 3. Decode latent parameters
We tackle this problem !!
𝒙 𝒛
Random
matrix
𝑦
𝒛𝐧𝐞𝐰
𝒛
𝒙

Proposed method: Key idea
This study focus on two types of constraint.
• Known equality constraint
→ Decomposition into variable and fixed
parameters is useful.
• Unknown inequality constraint
→ Introducing disentangled representation learning (DRL) into the nonlinear embedding is useful.
5/32
Nonlinear embedding
2. Bayesian optimization
on the latent space
𝑦
𝒛𝐧𝐞𝐰
𝒛
Encoder
(DNN)
𝒙𝟏
𝒙𝟐
𝒙𝟑
𝒛𝟏
𝒛𝟐
𝒛𝟑
Decoder
(DNN)
𝒛𝐧𝐞𝐰
𝒙𝐧𝐞𝐰
1. Encode datasets 3. Decode datasets
Disentangled representation learning
• Latent parameter is interpretable and
independent
𝒛
Decoder
(DNN)
“rotation”
“smile”
The Figure is from 𝛽-VAE [ICLR’17]
• DRL is generally used to control
generative models (VAE, GAN, …)
Variable parameters w/o
equality constraints
Fixed parameters w/
Parameters
Bayesian
optimization
Explored
parameters
condition
𝒙𝐯
𝒙𝐟 𝒙𝐟
𝒙𝐟
𝒙
𝒙𝐯
𝐧𝐞𝐰
𝒙𝐯
𝐧𝐞𝐰

6/32
Nonlinear embedding
on the latent space
𝑦
𝒛𝐧𝐞𝐰
𝒛
Encoder
(DNN)
𝒙𝟏
𝒙𝟐
𝒙𝟑
𝒛𝟏
𝒛𝟐
𝒛𝟑
Decoder
(DNN)
𝒛𝐧𝐞𝐰
𝒙𝐧𝐞𝐰
independent
𝒛
Decoder
(DNN)
“rotation”
“smile”
Fixed parameters w/
Parameters
Bayesian
optimization
Explored
parameters
condition
𝒙𝐯
𝒙𝐟 𝒙𝐟
𝒙𝐟
𝒙
𝒙𝐯
𝐧𝐞𝐰
𝒙𝐯
𝐧𝐞𝐰

7/32
Nonlinear embedding
on the latent space
𝑦
𝒛𝐧𝐞𝐰
𝒛
Encoder
(DNN)
𝒙𝟏
𝒙𝟐
𝒙𝟑
𝒛𝟏
𝒛𝟐
𝒛𝟑
Decoder
(DNN)
𝒛𝐧𝐞𝐰
𝒙𝐧𝐞𝐰
independent
𝒛
Decoder
(DNN)
“rotation”
“smile”
Fixed parameters w/
Parameters
Bayesian
optimization
Explored
parameters
condition
𝒙𝐯
𝒙𝐟 𝒙𝐟
𝒙𝐟
𝒙
𝒙𝐯
𝐧𝐞𝐰
𝒙𝐯
𝐧𝐞𝐰

8/32
Problem in the previous methods
axis #2
axis #1
Nonlinear
embedding
Latent parameter space
Original parameter space
axis #2
axis #1
We can just apply BO in a region
that satisfy constraints if the
inequality constraints are known.
Region that
satisfy constraints
Region that does not
satisfy constraints
BO possibly generate parameters that
does not satisfy constraints.

→ Introducing disentangled representation learning (DRL) into the nonlinear embedding is useful
because users need only check whether the constraints are satisfied for data in each axis.
9/32
Example 1: Generating face with a constraint of man face.
Decoder This region locally
satisfy the constraint !
This axis is not
related to the
constraint
(to be a man face)
axis #1
axis #1
axis #2
axis #2
rotation

10/32
axis #1
axis #1
Decoder
This axis is not
related to the
constraint
(to be a man face)
axis #2
axis #2
smiling

11/32
Mixed features also
satisfy constraints
axis #1
axis #1
axis #2
axis #2

12/32
Decoder
Exploration area is
restricted to axis #1.
This axis is related
to the constraint
(to be smiling face)
Example 1: Generating face with a constraint of similing face.
This region possibly does not
satisfy the constraint.
axis #1
axis #1
axis #2
axis #2
smiling
!

13/32
axis #1
axis #2
axis #1
axis #2
Exploration area in
the example 1
Exploration area in
the example 2
We can control the exploration are in
the latent space even if the inequality
constraints are unknown.

Proposed method: Overview
14/32

Step1: Dimensionality reduction
For variable parameters, we used 𝛽-VAE to introduce DRL into VAE and
acquired the latent space 𝑧v ∈ ℝ𝒅𝐯. (For fixed parameters, we used PCA for simplicity.)
15/32
Hyperparameters (the dimensionality of the latent space 𝒅𝐯 and the coefficient 𝜷)
control a tradeoff between two losses.
Reconstruction loss KL-divergence loss
large 𝛽
×
BO generates rough-grained features.
(-> hard to optimize parameters)
〇
Features are more disentangled.
small 𝛽
〇
BO generates fine-grained features
×
Features are less disentangled.
(-> hard to consider constraints)
Reconstruction loss + 𝛽 ∗ KL-divergence loss
𝛽-VAE loss =
𝑧
𝑧 loss
𝑧
𝑥 𝑥′
loss
𝒩(0, 1)
encoder decoder

Step2: Bayesian optimization
We used Gaussian process regression (GPR) and maximized the
UCB (upper confidence bound) acquisition function 𝑎(𝑧v, 𝑧f = 𝑧f
target
).
16/32
𝑦
𝑦
𝑧v, 𝑧f
𝑎UCB(𝑧v, 𝑧f)
𝑎UCB 𝑧v, 𝑧f = 𝜇 𝑧v, 𝑧f + 𝛼 ∗ 𝜎 𝑧v, 𝑧f
𝜇 𝑧v, 𝑧f
𝜎 𝑧v, 𝑧f
Variance
Acquisition function
Mean
We generated three parameters 𝒛𝐯
𝐧𝐞𝐰
and let user to select one of them.
- exploitation-oriented (𝛼 = 0.001)
- intermediate (𝛼 = 0.5)
- exploration-oriented (𝛼 = 1.0)
𝒛𝐯
𝐧𝐞𝐰
, 𝒛𝐟
𝐭𝐚𝐫𝐠𝐞𝐭
𝑧v, 𝑧f 𝑧v, 𝑧f
Gaussian process regression

Usage Scenario: Powder weighing system
System overview
The system needs to precisely weigh a powder by changing a valve opening degree 𝑣𝑖 → 𝑣𝑖+1
if the scale value reached a corresponding switching weight 𝑠𝑖+1 (0 ≤ 𝑖 ≤ 8).
17/32
Valve opening degree
9 steps
Switching
weight
𝑣0
𝑣1
𝑠1 𝑠9
𝑣9
(start)
(end)

Two types of inequality constraints
• Non-negative constraints : 𝑣𝑖 > 0, 𝑠𝑖 > 0
• Monotonically decreasing constraints : 𝑣𝑖 > 𝑣𝑖+1, 𝑠𝑖 < 𝑠𝑖+1
18/32
Valve opening degree
9 steps
Switching
weight
𝑣0
𝑣1
𝑠1 𝑠9
𝑣9
(start)
(end)

Preprocessing
19/32
• Normalization
• Outlier removal
• Duplication removal to prevent
imbalanced learning
• Train/Test split
• Normalization
• Outlier removal
• Data filtering to restrict the
exploration area locally
• Train/Test split

Datasets
20/32
contained 60 types of powder and consisted of 1,792 trials (the average is 31.33±19.48).
Parameters 𝒙𝐟 w/ equality constraints
(used for learning PCA and GPR)
Parameters 𝒙𝐯 w/o equality constraints
(used for learning 𝛽-VAE and GPR)
ൠ
Objective value 𝑦 representing an error
between the measured and required weight.
(used for learning GPR)
ൠ

Experiments overview
Experiment 1-1, 1-2
Experiment 2
21/32
We verify the effect of hyperparameters in 𝛽-VAE learning on considering inequality constraints.
We verify whether the proposed method could determine optimum parameters within
a reasonable number of trials.
the weighing error 𝑦 is less than 1% of the required weight.
(manual tuning needs typically about 20 trials in practice.)
Hyperparameters
?
𝑑v and 𝛽
Considering
inequality constraints
Considering
inequality constraints ? The number of
required trials

Experiment 1-1, 1-2
Experiment 2
22/32
Hyperparameters
?
𝑑v and 𝛽
Considering
Considering
required trials

Experiment 1-1: Evaluation on hyperparameter effects quantitatively.
23/32
ቊ
𝑑v ∈ 2, 4, 6, 8, 10
𝛽 ∈ {0.1, 0.2, … , 1.5}
Hyperparameters
value
selection
𝛽-VAE learning
evaluation
The number of the unsuitable
data which does not satisfy
constraints
Procedure (× 75 times for all hyperparameter combinations)
Sampled randomly in the
latent space (𝑛 = 1000)
𝑧
𝑥
Decoder
(DNN)
Satisfy constraints?
Suitable
Unsuitable
Sampled randomly in the
original space (𝑛 = 1000)
ℝ𝒅𝐯

Experiment 1-1: Evaluation on hyperparameter effects quantitatively.
Result
24/32
Findings
• Larger 𝛽 decreases the unsuitable data.
-> We guess that DRL enable us to consider the inequality constraints.
• Larger 𝑑v in the latent space increases the unsuitable data.
-> We guess that samples far from the origin of the latent space tend to be the
unsuitable data because fine-grained features are emphasized in the area
far from the origin.
The number
of unsuitable
data
Regions where undesirable
features are emphasized
𝑧
Exploration area

Experiment 1-2: Evaluation on hyperparameter effects qualitatively.
25/32
ቊ
𝑑v = 2
𝛽 ∈ {0.1, 0.5, 1.0}
Hyperparameters
value
selection
𝛽-VAE learning
Visualization
The meaning of disentangled
features
Procedure (× 3 times for all hyperparameter combinations)
Sampled at equal intervals
along the axes in the latent
space (𝑛 = 15)
𝑧 𝑥
Decoder
(DNN)
Check whether the
disentangled features
satisfy constraints
Sampled at equal intervals
along the axes in the original
space (𝑛 = 15)

Result
26/32
𝑧
Sufficient consideration of constraints, poor diversity
Lack of consideration of constraints, rich diversity
Initial point changes
Initial point and the
gradient change
Valve opening degree Valve opening degree Valve opening degree
Switching
weight
Switching
weight

Result
27/32
𝑧
Sufficient consideration of constraints, poor diversity
Lack of consideration of constraints, rich diversity
Initial point changes
We used this setting
in the next experiment 2.
Initial point and the
gradient change

Experiment 1
Discussion
• Can DRL consider inequality constraints ?
➢ YES.
• How should we set the hyperparameter values ?
➢ To determine the hyperparameter values,
the visualization of the effect quantitatively
and qualitatively is helpful.
➢ We recommend to determine the value of 𝑑v first
because the suitable value of 𝛽 depends on 𝑑v value.
28
Acceptable
parameters
area
𝑑v
smaller
𝑑v
larger
smaller 𝛽
larger 𝛽
Reconstruction
loss is too high
(-> parameters
have poor diversity)
Lack of
consideration
of constraints

Experiment 1-1, 1-2
Experiment 2
29/32
Hyperparameters
?
𝑑v and 𝛽
Considering
Considering
required trials

Experiment 2
Procedure
Result
30/32
• We used three types of powder A, B, and C not included in the dataset.
• we can see that powders A, B, and C are not outliers.
• From the result of the experiment 1, we set 𝑑v = 2 , 𝛽 = 0.1
which leads rich diversity and low reconstruction error.
The proposed method contributes to reducing the number
of trials (from 20 to around 5) compared to the baseline.
Baseline (manual tuning)
Features of the generated parameters
• For powders B and C, we successfully satisfy the
constraints in all trials.
• For powder A, we generated the unsuitable data in one
trial because the proposed method seems to have
explored areas far from the origin of the latent space
(from the observation in the experiment 1).
PCA visualization of fixed parameters

Limitations
• The relationship between hyperparameters and the number of required trial is still unclear.
• The exploration area needs to be set by manual.
31/32
Considering
The number of
required trials
Hyperparameters
𝑑v and 𝛽
Experiment 1 Experiment 2
?
Size of the exploration area
small
𝑧 𝑧 𝑧
Optimal
parameters
large
Generating parameters that
does not satisfy constraints
Optimal parameters
cannot be explored.
Exploration area
(bounding box)
Future work
Best area !!

Conclusion
• We proposed methods to handle two types of constraints in Bayesian optimization
even after the nonlinear embedding.
➢ Known equality constraints : Parameter decomposition is useful.
➢ Unknown inequality constraints : Disentangled representation learning is useful.
• We conducted two experiments.
➢ Experiment 1 showed the effect of hyperparameters on considering inequality constraints
and the visualization to determine the values.
➢ Experiment 2 demonstrated that the proposed method contributes to reducing the number
of trials by approximately 66% compared to the manual tuning.
32/32

Do you have any questions ?
33/32

High-Dimensional Bayesian Optimization with Constraints: Application to Powder weighing (PDPAT2022/MPS139)

High-Dimensional Bayesian Optimization with Constraints: Application to Powder weighing (PDPAT2022/MPS139)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to High-Dimensional Bayesian Optimization with Constraints: Application to Powder weighing (PDPAT2022/MPS139)

Similar to High-Dimensional Bayesian Optimization with Constraints: Application to Powder weighing (PDPAT2022/MPS139) (20)

Recently uploaded

Recently uploaded (20)

High-Dimensional Bayesian Optimization with Constraints: Application to Powder weighing (PDPAT2022/MPS139)