1. US Army Corps of Engineers
BUILDING STRONG®
Choosing a Probability
Distribution
Charles Yoe
Professor of Economics, Notre Dame of
Maryland University
Huntington District
March, 2015
2. BUILDING STRONG®
Learning Objectives
At the end of this session participants
will be able to:
Systematically identify the best
distribution to use for a single uncertain
variable
4. BUILDING STRONG®
Using Probability
Sometimes you will estimate the
probability of an event
Sometimes you will use distributions to
►Describe data
►Model variability (range of consequences)
►Represent your uncertainty
Which distribution
should I use?
5. BUILDING STRONG®
Basic Distinctions
Constant
Variables
• Some things vary predictably
• Some things vary unpredictably
Random variables
• It can be something known but not
known by us
6. BUILDING STRONG®
A probability distribution is
simply a list or picture of all
possible outcomes and their
corresponding probabilities
7. BUILDING STRONG®
When the outcomes are
continuous, like lockage
times, then the notion of
probability takes on some
subtleties.
8. BUILDING STRONG®
Probability Mass Function
when sample space consists
of discrete outcomes, number
of tainter gate failures this year,
we can talk about the probability
of each outcome.
9. BUILDING STRONG®
Probability Density Function
For continuous outcome
spaces, we can “discretize”
the space into a finite set
of mutually exclusive and
exhaustive “bins”. We can
divide lockage times into
intervals: 0-50, >50-100,
>100-103.7 and so on.
11. BUILDING STRONG®
This measures the x variable value
This measures p(x), for a discrete variable
this is the probability of a corresponding x
12. BUILDING STRONG®
This measures the x variable value
This measures p(x), for a continuous variable
this is density and not the probability of x
13. BUILDING STRONG®
Notice the vertical axes in two normal distributions with a mean of 100.
When the SD is small (right) many more values must get ‘packed’ into a
smaller interval, making the data much more dense, 4 vs. .4.
Thus, the vertical axis for a continuous distribution has
no meaning as a probability. It measures the density of
the data.
15. BUILDING STRONG®
Checklist for Choosing a Distributions
From Some Data
1. Can you use your data?
2. Understand your variable
a) Source of data
b) Continuous/discrete
c) Bounded/unbounded
d) Meaningful
parameters
a) Do you know them? (1st
or 2nd order)
e) Univariate/multivariate
3. Look at your data—
plot it
4. Use theory
5. Calculate statistics
6. Use previous
experience
7. Distribution fitting
8. Expert opinion
9. Sensitivity analysis
16. BUILDING STRONG®
First!
Do you have data germane to your
population?
If so, do you need a distribution or can
you just use your data?
17. BUILDING STRONG®
One problem could be
bounding your data, if
you do not have the true
minimum and maximum
values.
Any dataset can be displayed as
a cumulative distribution function
or a general density function
20. BUILDING STRONG®
Fitting Empirical Distribution to Data
Rank data x(i) in
ascending order
Calculate the percentile
for each value
Use data and
percentiles to create
cumulative distribution
function
Data
Cumulative
Probability
Index Value F(x) = i/19
0 0 0
1 0.9 0.053
2 3.6 0.105
3 5.0 0.158
4 6.0 0.211
5 11.7 0.263
6 16.2 0.316
7 16.5 0.368
8 22.2 0.421
9 22.7 0.474
10 23.2 0.526
11 24.5 0.579
12 24.9 0.632
13 25.8 0.684
14 33.3 0.737
15 33.4 0.789
16 34.7 0.842
17 40.2 0.895
18 44.2 0.947
19 60.0 1
24. BUILDING STRONG®
Understand Your Data
►Experiments
►Observation
►Surveys
►Computer databases
►Literature searches
►Simulations
►Test case
Where did my
data come
from?
25. BUILDING STRONG®
►Discrete variables take one
of a set of identifiable values,
you can calculate its
probability of occurrence
►Continuous distributions- a
variable that can take any
value within a defined range,
you can’t calculate the
probability of a single value
Is my variable
discrete or
continuous?
26. BUILDING STRONG®
Barges in a tow
Houses in floodplain
People at a meeting
Results of a diagnostic test
Casualties per year
Relocations and acquisitions
Average number of barges per tow
Weight of an adult striped bass
Sensitivity or specificity of a diagnostic test
Transit time
Expected annual damages
Duration of a storm
Shoreline eroded
Sediment loads
28. BUILDING STRONG®
What Values Are Possible?
►Bounded-value confined to
lie between two determined
values
►Unbounded-value
theoretically extends from
minus infinity to plus infinity
►Partially bounded-
constrained at one end
(truncated distributions)
Is my variable
bounded or
unbounded?
29. BUILDING STRONG®
Continuous Distribution
Examples
Unbounded
► Normal
► t
► Logistic
Left Bounded
► Chi-square
► Exponential
► Gamma
► Lognormal
► Weibull
Bounded
► Beta
► Cumulative
► General/histogram
► Pert
► Uniform
► Triangle
31. BUILDING STRONG®
Are There Parameters
►Parametric--shape is
determined by mathematics
of a conceptual probability
model
►Non-parametric—empirical
distributions whose
mathematics is defined by
the shape required
Does my
variable have
meaningful
parameters?
32. BUILDING STRONG®
Parametric and Non-Parametric
Normal
Lognormal
Exponential
Poisson
Binomial
Gamma
Uniform
Pert
Triangular
Cumulative
33. BUILDING STRONG®
Do You Know the Parameters?
1st order distribution-
known parameters
►Risknormal(100,10)
2nd order distribution-
uncertain parameters
► Risknormal(risktriang(90,100,10
3),riskuniform(8,11))
Do I know the
parameters?
34. BUILDING STRONG®
Choose a parametric distribution
under these circumstances
Theory supports choice
Distribution proven accurate
for modelling your specific
variable
Distribution matches observed
data well
Need distribution with tail
extending beyond the
observed minimum or
maximum
35. BUILDING STRONG®
Choose a nonparametric
distribution under these
circumstances
Theory is lacking
There is no commonly used
model
Data are severely limited
Knowledge is limited to general
beliefs and some evidence
36. BUILDING STRONG®
Is It Dependent on Other
Variables?
►Univariate--not
probabilistically linked
to any other variable in
the model
►Multivariate--are
probabilistically linked
in some way
►Engineering
relationships are often
multivariate
Do the values of
my variable depend
on the values of
other variables?
37. BUILDING STRONG®
Distribution Type
No
Bounds
Bounded
Left &
Right
Left
Bound
Only Category Shape
Empirical
Distribution
Beta Continuous No Yes No Non-parametric Shape shifter No
Binomial Discrete No Yes No Parametric Some Flexibility No
Chi Squared Continuous No No Yes Parametric Basic Shape No
Cumulative ascending Continuous No Yes No Non-parametric Shape shifter Yes
Cumulative descending Continuous No Yes No Non-parametric Shape shifter Yes
Discrete Discrete No Yes No Non-parametric Shape shifter Yes
Discrete uniform Discrete No Yes No Non-parametric Basic Shape No
Erlang Continuous No No Yes Parametric Basic Shape No
Error Continuous Yes No No Parametric Some Flexibility No
Exponential Continuous No No Yes Parametric Basic Shape No
Extreme value Continuous No No Yes Parametric Basic Shape No
Gamma Continuous No No Yes Parametric Shape shifter No
General Continuous No Yes No Non-parametric Shape shifter Yes
Geometric Discrete No No Yes Parametric Some Flexibility No
Histogram Continuous No Yes No Non-parametric Shape shifter Yes
Hypergeometric Discrete No Yes No Parametric Some Flexibility No
Integer uniform Discrete No Yes No Non-parametric Basic Shape No
Inverse Gaussian Continuous No No Yes Parametric Basic Shape No
Logarithmic Discrete No No Yes Parametric Some Flexibility No
Logistic Continuous Yes No No Parametric Basic Shape No
Lognormal Continuous No No Yes Parametric Basic Shape No
Lognormal2 Continuous No No Yes Parametric Basic Shape No
Negative Binomial Discrete No No Yes Parametric Some Flexibility No
Normal Continuous Yes No No Parametric Basic Shape No
Pareto Continuous No No Yes Parametric Basic Shape No
Pareto2 Continuous No No Yes Parametric Basic Shape No
Pearson V Continuous No No Yes Parametric Some Flexibility No
Pearson VI Continuous No No Yes Parametric Some Flexibility No
PERT Continuous No Yes No Non-parametric Shape shifter No
Poisson Discrete No No Yes Parametric Basic Shape No
Rayleigh Continuous No No Yes Parametric Basic Shape No
Student Continuous Yes No No Parametric Basic Shape No
Triangle (various) Continuous No Yes No Non-parametric Some Flexibility No
Weibull Continuous No No Yes Parametric Shape shifter No
How can I make
this table
legible?
39. BUILDING STRONG®
Look for
distinctive
shapes and
features of your
data.
►Single peaks
►Symmetry
►Positive skew
►Negative values
Gamma,
Weibull, and
beta are useful
and flexible
functions.
40. BUILDING STRONG®
►Low coefficient of variation
& mean = median => Normal
►Positive skew & mean =
standard deviation =>
Exponential
►Consider outliers
Try calculating some
statistics to get a feel.
41. BUILDING STRONG®
Formal theory, e.g., CLT
Theoretical knowledge of
the variable
►Behavior or math
Informal theory
►Sums normal,
products lognormal
►Study specific
►Your best documented
thoughts on subject
Theory is your most
compelling reason for
choosing a distribution.
42. BUILDING STRONG®
Outliers
0 100 200 300 400 500 600
House Value
Extreme observations can
drastically influence a
probability model
What are
these points
telling you?
► What about your
world-view is
inconsistent with
this result?
► Should you
reconsider your
perspective?
► What possible
explanations
have you not yet
considered?
43. BUILDING STRONG®
If observation is an
error, remove it.
Your explanation must
be correct, not just
plausible.
If you must keep it and
can’t explain it, just live
with the results.
45. BUILDING STRONG®
Goodness of Fit
Provides statistical evidence to test
hypothesis that your data could have
come from a specific distribution
H0 these data come from an “x”
distribution
Small test statistic and large p mean
accept H0
It is another piece of evidence not a
determining factor
Geek slide
warning
47. BUILDING STRONG®
GOF Tests
Akaike Information
Criterion (AIC)
Bayesian Information
Criterion (BIC)
Chi-Square Test
Kolomogorov-Smirnov
Test
Andersen-Darling Test
Geek slide
warning
Better fit for means
than tails
Better fit at extreme
tails of distribution
Better fit for means
than tails
Use AIC or BIC
unless you have
reason not to
48. BUILDING STRONG®
►Data never collected
►Data too expensive or
impossible
►Past data irrelevant
►Opinion needed to fill
holes in sparse data
►New area of inquiry,
unique situation that
never existed
Why use
an expert?
49. BUILDING STRONG®
The distribution itself
►E.g. population is
normal
Parameters of the
distribution
►E.g. mean is x and
standard deviation is y
►E.g. 5th, 50th, 95th
percentile values
What might
an expert
estimate?
50. BUILDING STRONG®
Unsure which
distribution is best?
Try several
►If no difference you
are free to use any
one
►Significant
differences mean
doing more work
A final strategy is to test
the sensitivity of your
results to the uncertain
probability model.
53. BUILDING STRONG®
Know Your Data
Data are continuous.
Practical minimum and maximum temperatures for these
coastal waters during the summer are indefinite. Fuzzy
bounds mean you can treat quantity as unbounded over a
limited range of the number line.
Choose a normal (parametric) distribution because some
theory supports choice, distribution matches the observed data
well, and distribution has a tail extending beyond the observed
minimum and maximum. That information will be revealed in
subsequent steps.
Parameters estimates mean a first order distribution.
Univariate.
54. BUILDING STRONG®
Look at It
0
2
4
6
8
10
12
14
16
18
Percent
Daily Maximum Water Temperature
Water Temperature (Degrees Celsius)
55. BUILDING STRONG®
Look Again!
27.8 28 28.2 28.4 28.6 28.8 29 29.2 29.4 29.6
Daily Maximum Water Temperature
Water Temperature Boxplot
The boxplot confirms a slight skew to the
left in the sample data and in the
interquartile range.
56. BUILDING STRONG®
Theory
The “system” that produces a daily
maximum temperature is complex
enough and random enough that we
believe deviations about the mean are
likely to be symmetrical.
57. BUILDING STRONG®
Statistics
Descriptive statistics
Count 100
Mean 28.8
Median 28.8
Sample standard deviation 0.3
Minimum 28.1
Maximum 29.5
Range 1.4
Standard error of the mean 0.027
Confidence interval 95.% lower 28.7
Confidence interval 95.% upper 28.9
Coefficient of variation 0.9
Daily Maximum Water Temperature
Mean and median are
approximately equal
Coefficient of variation (0
to 100+ scale) is small.
There are no outliers in
this dataset.
58. BUILDING STRONG®
GOF
Normal Weibull Logistic ExtValue Pareto
Chi-Squared Test
Chi-Sq Statistic 7.14 8.02 11.98 17.26 121.89
P-Value 0.7122 0.6269 0.2864 0.0688 0
Cr. Value @ 0.750 6.7372 6.7372 6.7372 6.7372 6.7372
Cr. Value @ 0.500 9.3418 9.3418 9.3418 9.3418 9.3418
Cr. Value @ 0.250 12.5489 12.5489 12.5489 12.5489 12.5489
Cr. Value @ 0.150 14.5339 14.5339 14.5339 14.5339 14.5339
Cr. Value @ 0.100 15.9872 15.9872 15.9872 15.9872 15.9872
Cr. Value @ 0.050 18.307 18.307 18.307 18.307 18.307
Cr. Value @ 0.025 20.4832 20.4832 20.4832 20.4832 20.4832
Cr. Value @ 0.010 23.2093 23.2093 23.2093 23.2093 23.2093
Cr. Value @ 0.005 25.1882 25.1882 25.1882 25.1882 25.1882
Cr. Value @ 0.001 29.5883 29.5883 29.5883 29.5883 29.5883
Chi-Square Test Results
59. BUILDING STRONG®
Choice
An expert opinion is not necessary.
Sensitivity analysis will not be necessary.
The normal distribution is continuous,
parametric, consistent with my theory,
successfully used in the past, and
statistically consistent with a my data.
Therefore I will use it.
60. BUILDING STRONG®
Take Away Points
Choosing the best distribution is where
most new risk assessors feel least
comfortable.
Choice of distribution matters.
Distributions come from data and
expert opinion.
Distribution fitting should never be the
basis for distribution choice.