Appropriate sampling of training points is one of the primary factors affecting the fidelity of surro- gate models. This paper investigates the relative advantage of probability-based uniform sampling over distance-based uniform sampling in training surrogate models whose system inputs follow a distribution. Using the probability of the inputs as the metric for sampling, the probability-based uniform sample points are obtained by the inverse transform sampling. To study the suitability of probability-based uniform sampling for surrogate modeling, the Mean Squared Error (MSE) of a monomial form is for- mulated based on the relationship between the squared error of a surrogate model and the volume or hypervolume per sample point. Two surrogate models are developed respectively using the same number of probability-based and distance-based uniform sample points to approximate the same system. Their fidelities are compared using the monomial MSE function. When the exponent of the monomial function is between 0 and 1, the fidelity of the surrogate model trained using probability-based uniform sampling is higher than that of the other one trained using distance-based uniform sampling. When the exponent is greater than 1 or less than 0, the fidelity comparison is reversed. This theoretical conclusion is suc- cessfully verified using standard test functions and an engineering application.
Introduction to ArtificiaI Intelligence in Higher Education
Sampling_WCSMO_2013_Jun
1. Characterizing Probability-based Uniform Sampling
for Surrogate Modeling
Junqiang Zhang, Souma Chowdhury, and Achille Messac
Department of Mechanical and Aerospace Engineering
Syracuse University
10th World Congress on Structural and Multidisciplinary Optimization
May 19-24, 2013, Orlando, Florida
2. Presentation Outline
2
• Background
• Research motivation
• Literature survey
• Probability-based and distance-based sampling
• Mean Squared Error (MSE) of a monomial form
• Comparison of fidelities using monomial MSE
• Testing the conclusion to fidelity comparison
• Window performance evaluation
3. 3
Background
• Surrogate modeling is a statistical approach used to develop
approximation functions that adequately represent the relationship
between inputs and outputs based on known data.
• Appropriate sampling of training points is one of the primary
factors affecting the fidelity of surrogate models.
4. • The inputs to a system can be physical parameters whose probability of
occurrence is known or predefined.
4
Research Motivation
• Between distance-based sampling and probability-based sampling,
which one should we select to generate the sample points of the inputs
to a system to develop its surrogate model with a higher fidelity?
x(1)
(2)
x
20
10
0
-10
-20
-20 -10 0 10 20
20
10
0
-10
-20
-20 -10 0 10 20
x(1)
(2)
x
Distance-based uniform Probability-based uniform
5. Literature Survey
Techniques developed to improve the fidelity of surrogate modeling:
• Ensemble or hybrid form of different types of surrogate models
• Adaptive selection of sample points: EGO
5
Sampling: The selection of a subset of individuals from within a population to
estimate characteristics of the whole population.
• Sequence generation: random, pseudo-random, low-discrepancy, low-dispersion,
Latin hypercube
• Probability-based sampling: inverse transform sampling, acceptance and rejection
sampling, importance sampling, Markov Chain Monte Carlo methods (Markov
Chain Monte Carlo Random Walk: Metropolis-Hastings Sampling and Gibbs
Sampling)
6. Probability-based Sampling
• Inverse transform sampling evaluates the values of random variables corresponding
to their designated values of the Cumulative Distribution Function (CDF).
Evaluation of probability-based sample points:
• x(k) (k = 1, 2, . . . , n): the kth independent random variable.
• Fk(x(k)): the CDF of the variable x(k).
• ci
(k) ∈ [0, 1] (i = 1, 2, . . . ,m): a sequence designated as the values of CDF of x(k).
6
k p 1 k
i k i x F c
x(1)
(2)
x
20
10
0
-10
-20
-20 -10 0 10 20
1
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
(1)
c
i
(2)
c
i
Inverse transform
Sequence Sample points
7. Distance-based Sampling
Evaluation of distance-based sample points:
• x(k) (k = 1, 2, . . . , n): the kth independent random variable.
• x(k) and x(k): the lower and upper boundaries of x(k).
min
max
(k) ∈ [0, 1] (i = 1, 2, . . . ,m): a sequence designated as the values of
(x(k)-xmin
• ci
(k))/(xmax
(k)-xmin
(k)) .
7
min max min
k d k k k k
xi x ci x x
20
10
0
-10
-20
-20 -10 0 10 20
x(1)
(2)
x
1
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
(1)
c
i
(2)
c
i
Scale
Sequence Sample points
8. Estimation of Mean Squared Error
• x: the input to the system under study.
• y: the actual output of the system.
• h(x): the surrogate model of the system.
• F(x): the CDF of input x.
• (xj , yj) where j = 1, 2, . . . , s : a set of test points
• Integrated Estimation of MSE:
• Empirical Estimation of MSE:
MSE y h x
s
• The output y is usually only known at a limited number of test points. The
integrated estimation of MSE is not readily available.
8
2
I MSE y h x dF x
2
1
1 s
E j j
j
9. Measure and Probability of a Region
9
Measure of a region: (volume is used regardless of the number of dimensions)
Probability of a region:
in higher
dimensional space
length area
volume hypervolume
Probability
10. Measures of Regions
• This study involves the relationship between the overall error of a surrogate model
and the sample density of its inputs.
• Sample density can be equivalently represented by (i) the measures of regions
associated with individual sample points, or (ii) the measures of regions confined
by sample points.
• Voronoi diagram: One approach to divide a sample space into regions associated
with individual points.
• For any point in the Voronoi region of ci , ci is its closest sample point measured by
the Euclidean distance.
64-point Sukharev grid 64-point Sobol sequence
10
ci
Voronoi region of ci
11. Measures of Regions
• The Voronoi regions bordering the boundaries are viewed as half regions, and the
associated with individual sample points
11
• The measures of regions associated with individual sample points and the
measures of regions confined by sample points are statistically equivalent.
regions bordering the four vertices are viewed as quarter regions.
the measures of regions
Quarter point
on vertices Quarter area
Confined regions
Voronoi diagram
Half point on
boundaries
on vertices
Half area
on vertices
the measures of regions
confined by sample points
statistically equivalent
12. Measure and Probability
• For a region A = ∪Ai. If any two subregions Ai and Aj in A satisfy Ai ∪ Aj
= ∅ for i ≠ j, the measure of A is V = Σ Vi.
• The probability of A is F. Fi is the probability of Ai. F = Σ Fi.
• Distance-based uniform sampling:
• Probability-based uniform sampling:
12
1
0.8
0.6
0.4
0.2
0
Probability-based and distance-based full factorial sampling
0 0.2 0.4 0.6 0.8 1
Measure of
a confined region
d
i
V
V
m
Probability of
a confined region
p
i
F
F
m
p
i
F
F
m
d
i
V
V
m
13. Mean Squared Error of a Monomial Form
• When the number of training points for a surrogate model increases, the sample
density also increases, and the volume per sample point decreases.
• The error of a surrogate model is related to the volume per sample point.
• The MSE of the region Ai 휖A is statistically formulated as a monomial function of Vi.
i i MSE A aV 0a
• This monomial function shows the change of MSE with respect to the change of Vi.
13
l
Note: For different values of measure Vi, the
exponent l and the parameter a can be fitted
different.
14. Mean Squared Error of the Entire Region A
14
• In a sample region A with a probability distribution F(x), the integrated
estimation of MSE of a surrogate model in the subregion Ai is
• Therefore,
2 2
y h x dF x y h x dF x
A A
i i
A
i
MSE A
i
dF x F
i
2
MSE A F y h x dF x
A
i
i i
• The MSE of the entire region A is the sum of MSE(Ai), i = 1, 2, . . . ,m.
m m
2 2
MSE y h x dF x y h x dF x MSE A F
1 1
i
i i
A i A i
15. Fidelity Comparison
Probability-based uniform Distance-based uniform
m l l m l
V V V
m l m l
MSE a F a F aF
m m m
F aF
p p p
MSE a V V
i i
m m
i i
1 1
1 p l m l l
M b b b
m
• Power mean:
1
1
1
1 d l m
aF m
1
p
i
i
MSE
V
• Power mean inequality: If q1 < q2, then Mq1(b1, . . . , bm) ≤ Mq2 (b1, . . . , bm). The two means
are equal if and only if b1 = . . . = bi = . . . = bm.
Probability-based uniform Distance-based uniform
m
(1) l=0 =
(2) l=1 =
(3) 0 < l < 1 ≤
(4) l > 1
≥
m
aF aFV
MSE V
m m
1 1
1 1
1 1
(5) l < 0 ≥
15
1 1
d
i i
i i
1 1
0
d V
MSE aF aF
m
0
1
p
i
i
aF
MSE V aF
m
d aFV
MSE
m
1
p
i
i
1
1
1
1
, ,
m q
q
q m i
i
p l d l MSE MSE
aF aF
p l d l MSE MSE
aF aF
p l d l MSE MSE
aF aF
q: a nonzero real number.
bi (i = 1, 2, . . . ,m): positive real numbers.
1
p
i
i
MSE
V
aF m
p MSE
d MSE
p MSE
p MSE
d MSE
d MSE
• For (3), (4), and (5), MSE(p) = MSE(d) if and only if V1
(p) = . . . = Vi
(p) = . . . = Vm
(p) = V/m .
16. Fitting Monomial MSE
t: an odd number, greater than 2.
m1, m2, …, m(t+1)/2, … , mt: increasing numbers of sample points.
16
• The exponent l and the parameter a can be fitted using
l
d V
MSE aF
m
1
1
l
d V
MSE aF
m
m1 m mt (t+1)/2
Number of sample points
Number of sample points
Fitted exponent MSE
Distance-based uniform
m(t+1)/2
V
1 /2
1 /2
l
d
t
t
MSE aF
m
l
d
t
t
V
MSE aF
m
Fit the exponent l and the parameter a for m(t+1)/2 using regression.
17. Testing the Fidelity Comparison Conclusion
• For each test function, using the same approach (either RBF or
Kriging), two surrogate models are developed using distance-based
and probability-based uniform sampling.
• The full factorial sampling from end to end is used to generate
sequences between 0 and 1.
• The probability distributions of all the variables are assumed as
independent Gaussian distributions.
17
18. Testing the Fidelity Comparison Conclusion
• Test function 1:
• Test points: Gaussian distribution: Mean is 0.5. Standard deviation is 0.15.
18
Exponent l > 1
Fidelity: probability-based lower than distance-based
19. Testing the Fidelity Comparison Conclusion
• Test function 2: Booth function in two-dimensional space
• Test points: Bivariate Gaussian. Means are 0. Standard deviations are 3.5.
19
Kriging: exponent l ≈ 0
Fidelity: almost the same, almost zero
RBF: exponent l > 1
Fidelity: probability-based lower than distance-based
20. Testing the Fidelity Comparison Conclusion
• Test function 3: Hartmann function in three-dimensional space
• Test points: Trivariate Gaussian distribution. All three means are 0.5. All
three standard deviations are 0.15.
Left: exponent 0 < l < 1
Fidelity: probability-based slightly higher than distance-based
20
Right: exponent l > 1
Fidelity: probability-based lower than distance-based
left
left
right
right
left right
21. 21
Window Performance Evaluation
• A CFD model of a passive triple-pane window is used to evaluate its rates of
heat transfer at sample points under climatic conditions in January and August.
Sobol sampling
Full factorial
sampling
Probability-based
Distance-based
Probability-based
Distance-based
Januar
y
or
August
• Sampling
• For either January or August, four series of surrogate models are developed
for different sampling approaches with increasing numbers of points.
• In this application, the exponent l is constrained to be greater than or equal to 0.
22. Sobol Sequence Used for January
22
(1) Generally, when the exponent is between 0 and 1, the RMSE for
probability-based sampling is less than those for distance-based sampling.
(2) Generally, when the exponent is greater than 1, the RMSE for probability-based
sampling is greater than those for distance-based sampling.
23. Sobol Sequence Used for August
23
(1) The exponent is generally between 0 and 1, the values of RMSE for
probability-based sampling are less than those for distance-based sampling.
(2) As the number of sample points is sufficiently increased, RMSE
decreases at a very slow speed. The value of exponent is approximately 1.
24. Full Factorial Sampling Used for January
23 33 43 53 63 73 83 93 23 33 43 53 63 73 83 93
24
(1) The RMSE values of the two series of surrogate models are very close.
(2)When the number of points increases from 43 to 63, the value of the
exponent of monomial MSE function drops from around1 to slightly above 0.
(3)When the exponent is close to 0 or 1, the RMSE values for two types of
sampling are very close.
25. Full Factorial Sampling Used for August
23 33 43 53 63 73 83 93 23 33 43 53 63 73 83 93
25
The values of exponent are between 0 and 1. The RMSE of the surrogate
models developed using probability-based sampling is less than that using
distance-based sampling.
26. 26
Concluding Remarks
1. The MSE of a monomial form is formulated based on the relationship
between the MSE of a surrogate model and the volume or hypervolume
per sample point.
2. When the exponent of the monomial MSE function is between 0 and 1,
the fidelity of the surrogate model trained using probability-based
uniform sampling is higher than that of the other one trained using
distance-based uniform sampling. When the value of the exponent is
greater than 1, the fidelity comparison is reversed.
3. Interestingly, it was observed that for the same problem, under different
scenarios or different surrogates, distance-based or probability-based
sampling could be more suitable.