© Mitsubishi Electric Corporation
High-Dimensional Bayesian Optimization with Constraints:
Application to Powder Weighing
Shoki Miyagawa, Atsuyoshi Yano, Naoko Sawada, Isamu Ogawa
(Mitsubishi Electric Corporation)
© Mitsubishi Electric Corporation
Background
Bayesian optimization (BO) can explore optimal parameters in black-box problems in limited trials.
2/32
Black-box
model
𝑥new
Bayesian
optimization
Input
parameters
Output
(to be maximized)
𝑦new
𝑦 𝑦
𝒙𝐧𝐞𝐰 𝑥
𝑥
𝑦
𝒙𝐧𝐞𝐰 𝑥
𝒚𝐧𝐞𝐰
© Mitsubishi Electric Corporation
Background
Bayesian optimization (BO) can explore optimal parameters in black-box problems in limited trials.
However, it cannot work for high-dimensional parameters (typically > 10) because the exploration
area is too wide.
3/32
Black-box
model
𝑥new
Bayesian
optimization
Input
parameters
Output
(to be maximized)
𝑦new
𝑦 𝑦
𝒙𝐧𝐞𝐰 𝑥
𝑥
𝑦
𝒙𝐧𝐞𝐰 𝑥
𝒚𝐧𝐞𝐰
© Mitsubishi Electric Corporation
Background
Related works explore parameters in a low-dimensional space acquired by the following methods.
4/32
REMBO [IJCAI’13]
LINEBO [ICML’19]
Efficient exploration
Constraints cannot be explicitly
expressed in the latent space.
Constraints can be easily introduced
Not efficient exploration
(particularly for image and NLP)
Dropout [IJCAI’17]
Original
Dropped
Subspace extraction / Linear embedding
𝒙𝐧𝐞𝐰
𝒙𝐢𝐧𝐢𝐭
BO in non-dropped
dimensions.
BO in
a single dimension.
BO in random
embedded dimensions.
Nonlinear embedding
2. BO in the latent space
𝑦
𝒛𝐧𝐞𝐰 𝒛
Encoder
(DNN)
𝒙𝟏
𝒙𝟐
𝒙𝟑
𝒛𝟏
𝒛𝟐
𝒛𝟑
Decoder
(DNN)
𝒛𝐧𝐞𝐰
𝒙𝐧𝐞𝐰
1. Encode datasets 3. Decode latent parameters
We tackle this problem !!
𝒙 𝒛
Random
matrix
𝑦
𝒛𝐧𝐞𝐰
𝒛
𝒙
© Mitsubishi Electric Corporation
Proposed method: Key idea
This study focus on two types of constraint.
• Known equality constraint
→ Decomposition into variable and fixed
parameters is useful.
• Unknown inequality constraint
→ Introducing disentangled representation learning (DRL) into the nonlinear embedding is useful.
5/32
Nonlinear embedding
2. Bayesian optimization
on the latent space
𝑦
𝒛𝐧𝐞𝐰
𝒛
Encoder
(DNN)
𝒙𝟏
𝒙𝟐
𝒙𝟑
𝒛𝟏
𝒛𝟐
𝒛𝟑
Decoder
(DNN)
𝒛𝐧𝐞𝐰
𝒙𝐧𝐞𝐰
1. Encode datasets 3. Decode datasets
Disentangled representation learning
• Latent parameter is interpretable and
independent
𝒛
Decoder
(DNN)
“rotation”
“smile”
The Figure is from 𝛽-VAE [ICLR’17]
• DRL is generally used to control
generative models (VAE, GAN, …)
Variable parameters w/o
equality constraints
Fixed parameters w/
equality constraints
Parameters
Bayesian
optimization
Explored
parameters
condition
𝒙𝐯
𝒙𝐟 𝒙𝐟
𝒙𝐟
𝒙
𝒙𝐯
𝐧𝐞𝐰
𝒙𝐯
𝐧𝐞𝐰
© Mitsubishi Electric Corporation
Proposed method: Key idea
This study focus on two types of constraint.
• Known equality constraint
→ Decomposition into variable and fixed
parameters is useful.
• Unknown inequality constraint
→ Introducing disentangled representation learning (DRL) into the nonlinear embedding is useful.
6/32
Nonlinear embedding
2. Bayesian optimization
on the latent space
𝑦
𝒛𝐧𝐞𝐰
𝒛
Encoder
(DNN)
𝒙𝟏
𝒙𝟐
𝒙𝟑
𝒛𝟏
𝒛𝟐
𝒛𝟑
Decoder
(DNN)
𝒛𝐧𝐞𝐰
𝒙𝐧𝐞𝐰
1. Encode datasets 3. Decode datasets
Disentangled representation learning
• Latent parameter is interpretable and
independent
𝒛
Decoder
(DNN)
“rotation”
“smile”
The Figure is from 𝛽-VAE [ICLR’17]
• DRL is generally used to control
generative models (VAE, GAN, …)
Variable parameters w/o
equality constraints
Fixed parameters w/
equality constraints
Parameters
Bayesian
optimization
Explored
parameters
condition
𝒙𝐯
𝒙𝐟 𝒙𝐟
𝒙𝐟
𝒙
𝒙𝐯
𝐧𝐞𝐰
𝒙𝐯
𝐧𝐞𝐰
© Mitsubishi Electric Corporation
Proposed method: Key idea
This study focus on two types of constraint.
• Known equality constraint
→ Decomposition into variable and fixed
parameters is useful.
• Unknown inequality constraint
→ Introducing disentangled representation learning (DRL) into the nonlinear embedding is useful.
7/32
Nonlinear embedding
2. Bayesian optimization
on the latent space
𝑦
𝒛𝐧𝐞𝐰
𝒛
Encoder
(DNN)
𝒙𝟏
𝒙𝟐
𝒙𝟑
𝒛𝟏
𝒛𝟐
𝒛𝟑
Decoder
(DNN)
𝒛𝐧𝐞𝐰
𝒙𝐧𝐞𝐰
1. Encode datasets 3. Decode datasets
Disentangled representation learning
• Latent parameter is interpretable and
independent
𝒛
Decoder
(DNN)
“rotation”
“smile”
The Figure is from 𝛽-VAE [ICLR’17]
• DRL is generally used to control
generative models (VAE, GAN, …)
Variable parameters w/o
equality constraints
Fixed parameters w/
equality constraints
Parameters
Bayesian
optimization
Explored
parameters
condition
𝒙𝐯
𝒙𝐟 𝒙𝐟
𝒙𝐟
𝒙
𝒙𝐯
𝐧𝐞𝐰
𝒙𝐯
𝐧𝐞𝐰
© Mitsubishi Electric Corporation
• Unknown inequality constraint
Proposed method: Key idea
8/32
Problem in the previous methods
axis #2
axis #1
Nonlinear
embedding
Latent parameter space
Original parameter space
axis #2
axis #1
We can just apply BO in a region
that satisfy constraints if the
inequality constraints are known.
Region that
satisfy constraints
Region that does not
satisfy constraints
BO possibly generate parameters that
does not satisfy constraints.
© Mitsubishi Electric Corporation
Proposed method: Key idea
• Unknown inequality constraint
→ Introducing disentangled representation learning (DRL) into the nonlinear embedding is useful
because users need only check whether the constraints are satisfied for data in each axis.
9/32
Example 1: Generating face with a constraint of man face.
Decoder This region locally
satisfy the constraint !
This axis is not
related to the
constraint
(to be a man face)
axis #1
axis #1
axis #2
axis #2
rotation
© Mitsubishi Electric Corporation
Proposed method: Key idea
• Unknown inequality constraint
→ Introducing disentangled representation learning (DRL) into the nonlinear embedding is useful
because users need only check whether the constraints are satisfied for data in each axis.
10/32
axis #1
axis #1
Decoder
This axis is not
related to the
constraint
(to be a man face)
axis #2
axis #2
Example 1: Generating face with a constraint of man face.
smiling
© Mitsubishi Electric Corporation
Proposed method: Key idea
• Unknown inequality constraint
→ Introducing disentangled representation learning (DRL) into the nonlinear embedding is useful
because users need only check whether the constraints are satisfied for data in each axis.
11/32
Mixed features also
satisfy constraints
axis #1
axis #1
axis #2
axis #2
Example 1: Generating face with a constraint of man face.
© Mitsubishi Electric Corporation
Proposed method: Key idea
• Unknown inequality constraint
→ Introducing disentangled representation learning (DRL) into the nonlinear embedding is useful
because users need only check whether the constraints are satisfied for data in each axis.
12/32
Decoder
Exploration area is
restricted to axis #1.
This axis is related
to the constraint
(to be smiling face)
Example 1: Generating face with a constraint of similing face.
This region possibly does not
satisfy the constraint.
axis #1
axis #1
axis #2
axis #2
smiling
!
© Mitsubishi Electric Corporation
Proposed method: Key idea
• Unknown inequality constraint
→ Introducing disentangled representation learning (DRL) into the nonlinear embedding is useful
because users need only check whether the constraints are satisfied for data in each axis.
13/32
axis #1
axis #2
axis #1
axis #2
Exploration area in
the example 1
Exploration area in
the example 2
We can control the exploration are in
the latent space even if the inequality
constraints are unknown.
© Mitsubishi Electric Corporation
Proposed method: Overview
14/32
© Mitsubishi Electric Corporation
Step1: Dimensionality reduction
For variable parameters, we used 𝛽-VAE to introduce DRL into VAE and
acquired the latent space 𝑧v ∈ ℝ𝒅𝐯. (For fixed parameters, we used PCA for simplicity.)
15/32
Hyperparameters (the dimensionality of the latent space 𝒅𝐯 and the coefficient 𝜷)
control a tradeoff between two losses.
Reconstruction loss KL-divergence loss
large 𝛽
×
BO generates rough-grained features.
(-> hard to optimize parameters)
〇
Features are more disentangled.
small 𝛽
〇
BO generates fine-grained features
×
Features are less disentangled.
(-> hard to consider constraints)
Reconstruction loss + 𝛽 ∗ KL-divergence loss
𝛽-VAE loss =
𝑧
𝑧 loss
𝑧
𝑥 𝑥′
loss
𝒩(0, 1)
encoder decoder
© Mitsubishi Electric Corporation
Step2: Bayesian optimization
We used Gaussian process regression (GPR) and maximized the
UCB (upper confidence bound) acquisition function 𝑎(𝑧v, 𝑧f = 𝑧f
target
).
16/32
𝑦
𝑦
𝑧v, 𝑧f
𝑎UCB(𝑧v, 𝑧f)
𝑎UCB 𝑧v, 𝑧f = 𝜇 𝑧v, 𝑧f + 𝛼 ∗ 𝜎 𝑧v, 𝑧f
𝜇 𝑧v, 𝑧f
𝜎 𝑧v, 𝑧f
Variance
Acquisition function
Mean
We generated three parameters 𝒛𝐯
𝐧𝐞𝐰
and let user to select one of them.
- exploitation-oriented (𝛼 = 0.001)
- intermediate (𝛼 = 0.5)
- exploration-oriented (𝛼 = 1.0)
𝒛𝐯
𝐧𝐞𝐰
, 𝒛𝐟
𝐭𝐚𝐫𝐠𝐞𝐭
𝑧v, 𝑧f 𝑧v, 𝑧f
Gaussian process regression
© Mitsubishi Electric Corporation
Usage Scenario: Powder weighing system
System overview
The system needs to precisely weigh a powder by changing a valve opening degree 𝑣𝑖 → 𝑣𝑖+1
if the scale value reached a corresponding switching weight 𝑠𝑖+1 (0 ≤ 𝑖 ≤ 8).
17/32
Valve opening degree
9 steps
Switching
weight
𝑣0
𝑣1
𝑠1 𝑠9
𝑣9
(start)
(end)
© Mitsubishi Electric Corporation
Usage Scenario: Powder weighing system
Two types of inequality constraints
• Non-negative constraints : 𝑣𝑖 > 0, 𝑠𝑖 > 0
• Monotonically decreasing constraints : 𝑣𝑖 > 𝑣𝑖+1, 𝑠𝑖 < 𝑠𝑖+1
18/32
Valve opening degree
9 steps
Switching
weight
𝑣0
𝑣1
𝑠1 𝑠9
𝑣9
(start)
(end)
© Mitsubishi Electric Corporation
Usage Scenario: Powder weighing system
Preprocessing
19/32
• Normalization
• Outlier removal
• Duplication removal to prevent
imbalanced learning
• Train/Test split
• Normalization
• Outlier removal
• Data filtering to restrict the
exploration area locally
• Train/Test split
© Mitsubishi Electric Corporation
Usage Scenario: Powder weighing system
Datasets
20/32
contained 60 types of powder and consisted of 1,792 trials (the average is 31.33±19.48).
Parameters 𝒙𝐟 w/ equality constraints
(used for learning PCA and GPR)
Parameters 𝒙𝐯 w/o equality constraints
(used for learning 𝛽-VAE and GPR)
ൠ
Objective value 𝑦 representing an error
between the measured and required weight.
(used for learning GPR)
ൠ
© Mitsubishi Electric Corporation
Experiments overview
Experiment 1-1, 1-2
Experiment 2
21/32
We verify the effect of hyperparameters in 𝛽-VAE learning on considering inequality constraints.
We verify whether the proposed method could determine optimum parameters within
a reasonable number of trials.
the weighing error 𝑦 is less than 1% of the required weight.
(manual tuning needs typically about 20 trials in practice.)
Hyperparameters
?
𝑑v and 𝛽
Considering
inequality constraints
Considering
inequality constraints ? The number of
required trials
© Mitsubishi Electric Corporation
Experiments overview
Experiment 1-1, 1-2
Experiment 2
22/32
We verify the effect of hyperparameters in 𝛽-VAE learning on considering inequality constraints.
We verify whether the proposed method could determine optimum parameters within
a reasonable number of trials.
the weighing error 𝑦 is less than 1% of the required weight.
(manual tuning needs typically about 20 trials in practice.)
Hyperparameters
?
𝑑v and 𝛽
Considering
inequality constraints
Considering
inequality constraints ? The number of
required trials
© Mitsubishi Electric Corporation
Experiment 1-1: Evaluation on hyperparameter effects quantitatively.
23/32
ቊ
𝑑v ∈ 2, 4, 6, 8, 10
𝛽 ∈ {0.1, 0.2, … , 1.5}
Hyperparameters
value
selection
𝛽-VAE learning
evaluation
The number of the unsuitable
data which does not satisfy
constraints
Procedure (× 75 times for all hyperparameter combinations)
Sampled randomly in the
latent space (𝑛 = 1000)
𝑧
𝑥
Decoder
(DNN)
Satisfy constraints?
Suitable
Unsuitable
Sampled randomly in the
original space (𝑛 = 1000)
ℝ𝒅𝐯
© Mitsubishi Electric Corporation
Experiment 1-1: Evaluation on hyperparameter effects quantitatively.
Result
24/32
Findings
• Larger 𝛽 decreases the unsuitable data.
-> We guess that DRL enable us to consider the inequality constraints.
• Larger 𝑑v in the latent space increases the unsuitable data.
-> We guess that samples far from the origin of the latent space tend to be the
unsuitable data because fine-grained features are emphasized in the area
far from the origin.
The number
of unsuitable
data
Regions where undesirable
features are emphasized
𝑧
Exploration area
© Mitsubishi Electric Corporation
Experiment 1-2: Evaluation on hyperparameter effects qualitatively.
25/32
ቊ
𝑑v = 2
𝛽 ∈ {0.1, 0.5, 1.0}
Hyperparameters
value
selection
𝛽-VAE learning
Visualization
The meaning of disentangled
features
Procedure (× 3 times for all hyperparameter combinations)
Sampled at equal intervals
along the axes in the latent
space (𝑛 = 15)
𝑧 𝑥
Decoder
(DNN)
Check whether the
disentangled features
satisfy constraints
Sampled at equal intervals
along the axes in the original
space (𝑛 = 15)
© Mitsubishi Electric Corporation
Experiment 1-2: Evaluation on hyperparameter effects qualitatively.
Result
26/32
𝑧
Sufficient consideration of constraints, poor diversity
Lack of consideration of constraints, rich diversity
Initial point changes
Initial point and the
gradient change
Valve opening degree Valve opening degree Valve opening degree
Switching
weight
Switching
weight
© Mitsubishi Electric Corporation
Experiment 1-2: Evaluation on hyperparameter effects qualitatively.
Result
27/32
𝑧
Sufficient consideration of constraints, poor diversity
Lack of consideration of constraints, rich diversity
Initial point changes
We used this setting
in the next experiment 2.
Initial point and the
gradient change
© Mitsubishi Electric Corporation
Experiment 1
Discussion
• Can DRL consider inequality constraints ?
➢ YES.
• How should we set the hyperparameter values ?
➢ To determine the hyperparameter values,
the visualization of the effect quantitatively
and qualitatively is helpful.
➢ We recommend to determine the value of 𝑑v first
because the suitable value of 𝛽 depends on 𝑑v value.
28
Acceptable
parameters
area
𝑑v
smaller
𝑑v
larger
smaller 𝛽
larger 𝛽
Reconstruction
loss is too high
(-> parameters
have poor diversity)
Lack of
consideration
of constraints
© Mitsubishi Electric Corporation
Experiments overview
Experiment 1-1, 1-2
Experiment 2
29/32
We verify the effect of hyperparameters in 𝛽-VAE learning on considering inequality constraints.
We verify whether the proposed method could determine optimum parameters within
a reasonable number of trials.
the weighing error 𝑦 is less than 1% of the required weight.
(manual tuning needs typically about 20 trials in practice.)
Hyperparameters
?
𝑑v and 𝛽
Considering
inequality constraints
Considering
inequality constraints ? The number of
required trials
© Mitsubishi Electric Corporation
Experiment 2
Procedure
Result
30/32
• We used three types of powder A, B, and C not included in the dataset.
• we can see that powders A, B, and C are not outliers.
• From the result of the experiment 1, we set 𝑑v = 2 , 𝛽 = 0.1
which leads rich diversity and low reconstruction error.
The proposed method contributes to reducing the number
of trials (from 20 to around 5) compared to the baseline.
Baseline (manual tuning)
Features of the generated parameters
• For powders B and C, we successfully satisfy the
constraints in all trials.
• For powder A, we generated the unsuitable data in one
trial because the proposed method seems to have
explored areas far from the origin of the latent space
(from the observation in the experiment 1).
PCA visualization of fixed parameters
© Mitsubishi Electric Corporation
Limitations
• The relationship between hyperparameters and the number of required trial is still unclear.
• The exploration area needs to be set by manual.
31/32
Considering
inequality constraints
The number of
required trials
Hyperparameters
𝑑v and 𝛽
Experiment 1 Experiment 2
?
Size of the exploration area
small
𝑧 𝑧 𝑧
Optimal
parameters
large
Generating parameters that
does not satisfy constraints
Optimal parameters
cannot be explored.
Exploration area
(bounding box)
Future work
Best area !!
© Mitsubishi Electric Corporation
Conclusion
• We proposed methods to handle two types of constraints in Bayesian optimization
even after the nonlinear embedding.
➢ Known equality constraints : Parameter decomposition is useful.
➢ Unknown inequality constraints : Disentangled representation learning is useful.
• We conducted two experiments.
➢ Experiment 1 showed the effect of hyperparameters on considering inequality constraints
and the visualization to determine the values.
➢ Experiment 2 demonstrated that the proposed method contributes to reducing the number
of trials by approximately 66% compared to the manual tuning.
32/32
© Mitsubishi Electric Corporation
Do you have any questions ?
33/32
High-Dimensional Bayesian Optimization with Constraints: Application to Powder weighing (PDPAT2022/MPS139)

High-Dimensional Bayesian Optimization with Constraints: Application to Powder weighing (PDPAT2022/MPS139)

  • 1.
    © Mitsubishi ElectricCorporation High-Dimensional Bayesian Optimization with Constraints: Application to Powder Weighing Shoki Miyagawa, Atsuyoshi Yano, Naoko Sawada, Isamu Ogawa (Mitsubishi Electric Corporation)
  • 2.
    © Mitsubishi ElectricCorporation Background Bayesian optimization (BO) can explore optimal parameters in black-box problems in limited trials. 2/32 Black-box model 𝑥new Bayesian optimization Input parameters Output (to be maximized) 𝑦new 𝑦 𝑦 𝒙𝐧𝐞𝐰 𝑥 𝑥 𝑦 𝒙𝐧𝐞𝐰 𝑥 𝒚𝐧𝐞𝐰
  • 3.
    © Mitsubishi ElectricCorporation Background Bayesian optimization (BO) can explore optimal parameters in black-box problems in limited trials. However, it cannot work for high-dimensional parameters (typically > 10) because the exploration area is too wide. 3/32 Black-box model 𝑥new Bayesian optimization Input parameters Output (to be maximized) 𝑦new 𝑦 𝑦 𝒙𝐧𝐞𝐰 𝑥 𝑥 𝑦 𝒙𝐧𝐞𝐰 𝑥 𝒚𝐧𝐞𝐰
  • 4.
    © Mitsubishi ElectricCorporation Background Related works explore parameters in a low-dimensional space acquired by the following methods. 4/32 REMBO [IJCAI’13] LINEBO [ICML’19] Efficient exploration Constraints cannot be explicitly expressed in the latent space. Constraints can be easily introduced Not efficient exploration (particularly for image and NLP) Dropout [IJCAI’17] Original Dropped Subspace extraction / Linear embedding 𝒙𝐧𝐞𝐰 𝒙𝐢𝐧𝐢𝐭 BO in non-dropped dimensions. BO in a single dimension. BO in random embedded dimensions. Nonlinear embedding 2. BO in the latent space 𝑦 𝒛𝐧𝐞𝐰 𝒛 Encoder (DNN) 𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒛𝟏 𝒛𝟐 𝒛𝟑 Decoder (DNN) 𝒛𝐧𝐞𝐰 𝒙𝐧𝐞𝐰 1. Encode datasets 3. Decode latent parameters We tackle this problem !! 𝒙 𝒛 Random matrix 𝑦 𝒛𝐧𝐞𝐰 𝒛 𝒙
  • 5.
    © Mitsubishi ElectricCorporation Proposed method: Key idea This study focus on two types of constraint. • Known equality constraint → Decomposition into variable and fixed parameters is useful. • Unknown inequality constraint → Introducing disentangled representation learning (DRL) into the nonlinear embedding is useful. 5/32 Nonlinear embedding 2. Bayesian optimization on the latent space 𝑦 𝒛𝐧𝐞𝐰 𝒛 Encoder (DNN) 𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒛𝟏 𝒛𝟐 𝒛𝟑 Decoder (DNN) 𝒛𝐧𝐞𝐰 𝒙𝐧𝐞𝐰 1. Encode datasets 3. Decode datasets Disentangled representation learning • Latent parameter is interpretable and independent 𝒛 Decoder (DNN) “rotation” “smile” The Figure is from 𝛽-VAE [ICLR’17] • DRL is generally used to control generative models (VAE, GAN, …) Variable parameters w/o equality constraints Fixed parameters w/ equality constraints Parameters Bayesian optimization Explored parameters condition 𝒙𝐯 𝒙𝐟 𝒙𝐟 𝒙𝐟 𝒙 𝒙𝐯 𝐧𝐞𝐰 𝒙𝐯 𝐧𝐞𝐰
  • 6.
    © Mitsubishi ElectricCorporation Proposed method: Key idea This study focus on two types of constraint. • Known equality constraint → Decomposition into variable and fixed parameters is useful. • Unknown inequality constraint → Introducing disentangled representation learning (DRL) into the nonlinear embedding is useful. 6/32 Nonlinear embedding 2. Bayesian optimization on the latent space 𝑦 𝒛𝐧𝐞𝐰 𝒛 Encoder (DNN) 𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒛𝟏 𝒛𝟐 𝒛𝟑 Decoder (DNN) 𝒛𝐧𝐞𝐰 𝒙𝐧𝐞𝐰 1. Encode datasets 3. Decode datasets Disentangled representation learning • Latent parameter is interpretable and independent 𝒛 Decoder (DNN) “rotation” “smile” The Figure is from 𝛽-VAE [ICLR’17] • DRL is generally used to control generative models (VAE, GAN, …) Variable parameters w/o equality constraints Fixed parameters w/ equality constraints Parameters Bayesian optimization Explored parameters condition 𝒙𝐯 𝒙𝐟 𝒙𝐟 𝒙𝐟 𝒙 𝒙𝐯 𝐧𝐞𝐰 𝒙𝐯 𝐧𝐞𝐰
  • 7.
    © Mitsubishi ElectricCorporation Proposed method: Key idea This study focus on two types of constraint. • Known equality constraint → Decomposition into variable and fixed parameters is useful. • Unknown inequality constraint → Introducing disentangled representation learning (DRL) into the nonlinear embedding is useful. 7/32 Nonlinear embedding 2. Bayesian optimization on the latent space 𝑦 𝒛𝐧𝐞𝐰 𝒛 Encoder (DNN) 𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒛𝟏 𝒛𝟐 𝒛𝟑 Decoder (DNN) 𝒛𝐧𝐞𝐰 𝒙𝐧𝐞𝐰 1. Encode datasets 3. Decode datasets Disentangled representation learning • Latent parameter is interpretable and independent 𝒛 Decoder (DNN) “rotation” “smile” The Figure is from 𝛽-VAE [ICLR’17] • DRL is generally used to control generative models (VAE, GAN, …) Variable parameters w/o equality constraints Fixed parameters w/ equality constraints Parameters Bayesian optimization Explored parameters condition 𝒙𝐯 𝒙𝐟 𝒙𝐟 𝒙𝐟 𝒙 𝒙𝐯 𝐧𝐞𝐰 𝒙𝐯 𝐧𝐞𝐰
  • 8.
    © Mitsubishi ElectricCorporation • Unknown inequality constraint Proposed method: Key idea 8/32 Problem in the previous methods axis #2 axis #1 Nonlinear embedding Latent parameter space Original parameter space axis #2 axis #1 We can just apply BO in a region that satisfy constraints if the inequality constraints are known. Region that satisfy constraints Region that does not satisfy constraints BO possibly generate parameters that does not satisfy constraints.
  • 9.
    © Mitsubishi ElectricCorporation Proposed method: Key idea • Unknown inequality constraint → Introducing disentangled representation learning (DRL) into the nonlinear embedding is useful because users need only check whether the constraints are satisfied for data in each axis. 9/32 Example 1: Generating face with a constraint of man face. Decoder This region locally satisfy the constraint ! This axis is not related to the constraint (to be a man face) axis #1 axis #1 axis #2 axis #2 rotation
  • 10.
    © Mitsubishi ElectricCorporation Proposed method: Key idea • Unknown inequality constraint → Introducing disentangled representation learning (DRL) into the nonlinear embedding is useful because users need only check whether the constraints are satisfied for data in each axis. 10/32 axis #1 axis #1 Decoder This axis is not related to the constraint (to be a man face) axis #2 axis #2 Example 1: Generating face with a constraint of man face. smiling
  • 11.
    © Mitsubishi ElectricCorporation Proposed method: Key idea • Unknown inequality constraint → Introducing disentangled representation learning (DRL) into the nonlinear embedding is useful because users need only check whether the constraints are satisfied for data in each axis. 11/32 Mixed features also satisfy constraints axis #1 axis #1 axis #2 axis #2 Example 1: Generating face with a constraint of man face.
  • 12.
    © Mitsubishi ElectricCorporation Proposed method: Key idea • Unknown inequality constraint → Introducing disentangled representation learning (DRL) into the nonlinear embedding is useful because users need only check whether the constraints are satisfied for data in each axis. 12/32 Decoder Exploration area is restricted to axis #1. This axis is related to the constraint (to be smiling face) Example 1: Generating face with a constraint of similing face. This region possibly does not satisfy the constraint. axis #1 axis #1 axis #2 axis #2 smiling !
  • 13.
    © Mitsubishi ElectricCorporation Proposed method: Key idea • Unknown inequality constraint → Introducing disentangled representation learning (DRL) into the nonlinear embedding is useful because users need only check whether the constraints are satisfied for data in each axis. 13/32 axis #1 axis #2 axis #1 axis #2 Exploration area in the example 1 Exploration area in the example 2 We can control the exploration are in the latent space even if the inequality constraints are unknown.
  • 14.
    © Mitsubishi ElectricCorporation Proposed method: Overview 14/32
  • 15.
    © Mitsubishi ElectricCorporation Step1: Dimensionality reduction For variable parameters, we used 𝛽-VAE to introduce DRL into VAE and acquired the latent space 𝑧v ∈ ℝ𝒅𝐯. (For fixed parameters, we used PCA for simplicity.) 15/32 Hyperparameters (the dimensionality of the latent space 𝒅𝐯 and the coefficient 𝜷) control a tradeoff between two losses. Reconstruction loss KL-divergence loss large 𝛽 × BO generates rough-grained features. (-> hard to optimize parameters) 〇 Features are more disentangled. small 𝛽 〇 BO generates fine-grained features × Features are less disentangled. (-> hard to consider constraints) Reconstruction loss + 𝛽 ∗ KL-divergence loss 𝛽-VAE loss = 𝑧 𝑧 loss 𝑧 𝑥 𝑥′ loss 𝒩(0, 1) encoder decoder
  • 16.
    © Mitsubishi ElectricCorporation Step2: Bayesian optimization We used Gaussian process regression (GPR) and maximized the UCB (upper confidence bound) acquisition function 𝑎(𝑧v, 𝑧f = 𝑧f target ). 16/32 𝑦 𝑦 𝑧v, 𝑧f 𝑎UCB(𝑧v, 𝑧f) 𝑎UCB 𝑧v, 𝑧f = 𝜇 𝑧v, 𝑧f + 𝛼 ∗ 𝜎 𝑧v, 𝑧f 𝜇 𝑧v, 𝑧f 𝜎 𝑧v, 𝑧f Variance Acquisition function Mean We generated three parameters 𝒛𝐯 𝐧𝐞𝐰 and let user to select one of them. - exploitation-oriented (𝛼 = 0.001) - intermediate (𝛼 = 0.5) - exploration-oriented (𝛼 = 1.0) 𝒛𝐯 𝐧𝐞𝐰 , 𝒛𝐟 𝐭𝐚𝐫𝐠𝐞𝐭 𝑧v, 𝑧f 𝑧v, 𝑧f Gaussian process regression
  • 17.
    © Mitsubishi ElectricCorporation Usage Scenario: Powder weighing system System overview The system needs to precisely weigh a powder by changing a valve opening degree 𝑣𝑖 → 𝑣𝑖+1 if the scale value reached a corresponding switching weight 𝑠𝑖+1 (0 ≤ 𝑖 ≤ 8). 17/32 Valve opening degree 9 steps Switching weight 𝑣0 𝑣1 𝑠1 𝑠9 𝑣9 (start) (end)
  • 18.
    © Mitsubishi ElectricCorporation Usage Scenario: Powder weighing system Two types of inequality constraints • Non-negative constraints : 𝑣𝑖 > 0, 𝑠𝑖 > 0 • Monotonically decreasing constraints : 𝑣𝑖 > 𝑣𝑖+1, 𝑠𝑖 < 𝑠𝑖+1 18/32 Valve opening degree 9 steps Switching weight 𝑣0 𝑣1 𝑠1 𝑠9 𝑣9 (start) (end)
  • 19.
    © Mitsubishi ElectricCorporation Usage Scenario: Powder weighing system Preprocessing 19/32 • Normalization • Outlier removal • Duplication removal to prevent imbalanced learning • Train/Test split • Normalization • Outlier removal • Data filtering to restrict the exploration area locally • Train/Test split
  • 20.
    © Mitsubishi ElectricCorporation Usage Scenario: Powder weighing system Datasets 20/32 contained 60 types of powder and consisted of 1,792 trials (the average is 31.33±19.48). Parameters 𝒙𝐟 w/ equality constraints (used for learning PCA and GPR) Parameters 𝒙𝐯 w/o equality constraints (used for learning 𝛽-VAE and GPR) ൠ Objective value 𝑦 representing an error between the measured and required weight. (used for learning GPR) ൠ
  • 21.
    © Mitsubishi ElectricCorporation Experiments overview Experiment 1-1, 1-2 Experiment 2 21/32 We verify the effect of hyperparameters in 𝛽-VAE learning on considering inequality constraints. We verify whether the proposed method could determine optimum parameters within a reasonable number of trials. the weighing error 𝑦 is less than 1% of the required weight. (manual tuning needs typically about 20 trials in practice.) Hyperparameters ? 𝑑v and 𝛽 Considering inequality constraints Considering inequality constraints ? The number of required trials
  • 22.
    © Mitsubishi ElectricCorporation Experiments overview Experiment 1-1, 1-2 Experiment 2 22/32 We verify the effect of hyperparameters in 𝛽-VAE learning on considering inequality constraints. We verify whether the proposed method could determine optimum parameters within a reasonable number of trials. the weighing error 𝑦 is less than 1% of the required weight. (manual tuning needs typically about 20 trials in practice.) Hyperparameters ? 𝑑v and 𝛽 Considering inequality constraints Considering inequality constraints ? The number of required trials
  • 23.
    © Mitsubishi ElectricCorporation Experiment 1-1: Evaluation on hyperparameter effects quantitatively. 23/32 ቊ 𝑑v ∈ 2, 4, 6, 8, 10 𝛽 ∈ {0.1, 0.2, … , 1.5} Hyperparameters value selection 𝛽-VAE learning evaluation The number of the unsuitable data which does not satisfy constraints Procedure (× 75 times for all hyperparameter combinations) Sampled randomly in the latent space (𝑛 = 1000) 𝑧 𝑥 Decoder (DNN) Satisfy constraints? Suitable Unsuitable Sampled randomly in the original space (𝑛 = 1000) ℝ𝒅𝐯
  • 24.
    © Mitsubishi ElectricCorporation Experiment 1-1: Evaluation on hyperparameter effects quantitatively. Result 24/32 Findings • Larger 𝛽 decreases the unsuitable data. -> We guess that DRL enable us to consider the inequality constraints. • Larger 𝑑v in the latent space increases the unsuitable data. -> We guess that samples far from the origin of the latent space tend to be the unsuitable data because fine-grained features are emphasized in the area far from the origin. The number of unsuitable data Regions where undesirable features are emphasized 𝑧 Exploration area
  • 25.
    © Mitsubishi ElectricCorporation Experiment 1-2: Evaluation on hyperparameter effects qualitatively. 25/32 ቊ 𝑑v = 2 𝛽 ∈ {0.1, 0.5, 1.0} Hyperparameters value selection 𝛽-VAE learning Visualization The meaning of disentangled features Procedure (× 3 times for all hyperparameter combinations) Sampled at equal intervals along the axes in the latent space (𝑛 = 15) 𝑧 𝑥 Decoder (DNN) Check whether the disentangled features satisfy constraints Sampled at equal intervals along the axes in the original space (𝑛 = 15)
  • 26.
    © Mitsubishi ElectricCorporation Experiment 1-2: Evaluation on hyperparameter effects qualitatively. Result 26/32 𝑧 Sufficient consideration of constraints, poor diversity Lack of consideration of constraints, rich diversity Initial point changes Initial point and the gradient change Valve opening degree Valve opening degree Valve opening degree Switching weight Switching weight
  • 27.
    © Mitsubishi ElectricCorporation Experiment 1-2: Evaluation on hyperparameter effects qualitatively. Result 27/32 𝑧 Sufficient consideration of constraints, poor diversity Lack of consideration of constraints, rich diversity Initial point changes We used this setting in the next experiment 2. Initial point and the gradient change
  • 28.
    © Mitsubishi ElectricCorporation Experiment 1 Discussion • Can DRL consider inequality constraints ? ➢ YES. • How should we set the hyperparameter values ? ➢ To determine the hyperparameter values, the visualization of the effect quantitatively and qualitatively is helpful. ➢ We recommend to determine the value of 𝑑v first because the suitable value of 𝛽 depends on 𝑑v value. 28 Acceptable parameters area 𝑑v smaller 𝑑v larger smaller 𝛽 larger 𝛽 Reconstruction loss is too high (-> parameters have poor diversity) Lack of consideration of constraints
  • 29.
    © Mitsubishi ElectricCorporation Experiments overview Experiment 1-1, 1-2 Experiment 2 29/32 We verify the effect of hyperparameters in 𝛽-VAE learning on considering inequality constraints. We verify whether the proposed method could determine optimum parameters within a reasonable number of trials. the weighing error 𝑦 is less than 1% of the required weight. (manual tuning needs typically about 20 trials in practice.) Hyperparameters ? 𝑑v and 𝛽 Considering inequality constraints Considering inequality constraints ? The number of required trials
  • 30.
    © Mitsubishi ElectricCorporation Experiment 2 Procedure Result 30/32 • We used three types of powder A, B, and C not included in the dataset. • we can see that powders A, B, and C are not outliers. • From the result of the experiment 1, we set 𝑑v = 2 , 𝛽 = 0.1 which leads rich diversity and low reconstruction error. The proposed method contributes to reducing the number of trials (from 20 to around 5) compared to the baseline. Baseline (manual tuning) Features of the generated parameters • For powders B and C, we successfully satisfy the constraints in all trials. • For powder A, we generated the unsuitable data in one trial because the proposed method seems to have explored areas far from the origin of the latent space (from the observation in the experiment 1). PCA visualization of fixed parameters
  • 31.
    © Mitsubishi ElectricCorporation Limitations • The relationship between hyperparameters and the number of required trial is still unclear. • The exploration area needs to be set by manual. 31/32 Considering inequality constraints The number of required trials Hyperparameters 𝑑v and 𝛽 Experiment 1 Experiment 2 ? Size of the exploration area small 𝑧 𝑧 𝑧 Optimal parameters large Generating parameters that does not satisfy constraints Optimal parameters cannot be explored. Exploration area (bounding box) Future work Best area !!
  • 32.
    © Mitsubishi ElectricCorporation Conclusion • We proposed methods to handle two types of constraints in Bayesian optimization even after the nonlinear embedding. ➢ Known equality constraints : Parameter decomposition is useful. ➢ Unknown inequality constraints : Disentangled representation learning is useful. • We conducted two experiments. ➢ Experiment 1 showed the effect of hyperparameters on considering inequality constraints and the visualization to determine the values. ➢ Experiment 2 demonstrated that the proposed method contributes to reducing the number of trials by approximately 66% compared to the manual tuning. 32/32
  • 33.
    © Mitsubishi ElectricCorporation Do you have any questions ? 33/32