Choosing a Probability Distribution - Charles Yoe

US Army Corps of Engineers
BUILDING STRONG®
Choosing a Probability
Distribution
Charles Yoe
Professor of Economics, Notre Dame of
Maryland University
Huntington District
March, 2015

BUILDING STRONG®
Learning Objectives
 At the end of this session participants
will be able to:
 Systematically identify the best
distribution to use for a single uncertain
variable

BUILDING STRONG®
Quantitative risk assessment
requires you to use probability

BUILDING STRONG®
Using Probability
 Sometimes you will estimate the
probability of an event
 Sometimes you will use distributions to
►Describe data
►Model variability (range of consequences)
►Represent your uncertainty
Which distribution
should I use?

BUILDING STRONG®
Basic Distinctions
 Constant
 Variables
• Some things vary predictably
• Some things vary unpredictably
 Random variables
• It can be something known but not
known by us

BUILDING STRONG®
A probability distribution is
simply a list or picture of all
possible outcomes and their
corresponding probabilities

BUILDING STRONG®
When the outcomes are
continuous, like lockage
times, then the notion of
probability takes on some
subtleties.

BUILDING STRONG®
Probability Mass Function
when sample space consists
of discrete outcomes, number
of tainter gate failures this year,
we can talk about the probability
of each outcome.

BUILDING STRONG®
Probability Density Function
For continuous outcome
spaces, we can “discretize”
the space into a finite set
of mutually exclusive and
exhaustive “bins”. We can
divide lockage times into
intervals: 0-50, >50-100,
>100-103.7 and so on.

BUILDING STRONG®
Probability
Density
Function
Cumulative
Distribution
Function
Survival Function

BUILDING STRONG®
This measures the x variable value
This measures p(x), for a discrete variable
this is the probability of a corresponding x

BUILDING STRONG®
This measures the x variable value
This measures p(x), for a continuous variable
this is density and not the probability of x

BUILDING STRONG®
Notice the vertical axes in two normal distributions with a mean of 100.
When the SD is small (right) many more values must get ‘packed’ into a
smaller interval, making the data much more dense, 4 vs. .4.
Thus, the vertical axis for a continuous distribution has
no meaning as a probability. It measures the density of
the data.

BUILDING STRONG®
Applying the Principles
 How many probability distributions can
we name without help?

BUILDING STRONG®
Checklist for Choosing a Distributions
From Some Data
1. Can you use your data?
2. Understand your variable
a) Source of data
b) Continuous/discrete
c) Bounded/unbounded
d) Meaningful
parameters
a) Do you know them? (1st
or 2nd order)
e) Univariate/multivariate
3. Look at your data—
plot it
4. Use theory
5. Calculate statistics
6. Use previous
experience
7. Distribution fitting
8. Expert opinion
9. Sensitivity analysis

BUILDING STRONG®
First!
 Do you have data germane to your
population?
 If so, do you need a distribution or can
you just use your data?

BUILDING STRONG®
One problem could be
bounding your data, if
you do not have the true
minimum and maximum
values.
Any dataset can be displayed as
a cumulative distribution function
or a general density function

BUILDING STRONG®
Display Either Way

BUILDING STRONG®19
 Go to @RISK
 What are advantages of PDF?
 CDF?

BUILDING STRONG®
Fitting Empirical Distribution to Data
 Rank data x(i) in
ascending order
 Calculate the percentile
for each value
 Use data and
percentiles to create
cumulative distribution
function
Data
Cumulative
Probability
Index Value F(x) = i/19
0 0 0
1 0.9 0.053
2 3.6 0.105
3 5.0 0.158
4 6.0 0.211
5 11.7 0.263
6 16.2 0.316
7 16.5 0.368
8 22.2 0.421
9 22.7 0.474
10 23.2 0.526
11 24.5 0.579
12 24.9 0.632
13 25.8 0.684
14 33.3 0.737
15 33.4 0.789
16 34.7 0.842
17 40.2 0.895
18 44.2 0.947
19 60.0 1

BUILDING STRONG®
For Thursday Everybody Buy
 M&M’s at UC Davis Store
 1.69 oz bag of Milk Chocolate M&M’s

BUILDING STRONG®
When You Can’t Use Your Data
 Given wide variety of
distributions it is not
always easy to select
the most appropriate
one

BUILDING STRONG®
Does Distribution Matter?
 Wrong assumption =>
 Incorrect results=>
 Poor decisions=>
 Undesirable outcomes

BUILDING STRONG®
Understand Your Data
►Experiments
►Observation
►Surveys
►Computer databases
►Literature searches
►Simulations
►Test case
Where did my
data come
from?

BUILDING STRONG®
►Discrete variables take one
of a set of identifiable values,
you can calculate its
probability of occurrence
►Continuous distributions- a
variable that can take any
value within a defined range,
you can’t calculate the
probability of a single value
Is my variable
discrete or
continuous?

BUILDING STRONG®
Barges in a tow
Houses in floodplain
People at a meeting
Results of a diagnostic test
Casualties per year
Relocations and acquisitions
Average number of barges per tow
Weight of an adult striped bass
Sensitivity or specificity of a diagnostic test
Transit time
Expected annual damages
Duration of a storm
Shoreline eroded
Sediment loads

BUILDING STRONG®
Choose Distribution That
Matches

BUILDING STRONG®
What Values Are Possible?
►Bounded-value confined to
lie between two determined
values
►Unbounded-value
theoretically extends from
minus infinity to plus infinity
►Partially bounded-
constrained at one end
(truncated distributions)
Is my variable
bounded or
unbounded?

BUILDING STRONG®
Continuous Distribution
Examples
 Unbounded
► Normal
► t
► Logistic
 Left Bounded
► Chi-square
► Exponential
► Gamma
► Lognormal
► Weibull
 Bounded
► Beta
► Cumulative
► General/histogram
► Pert
► Uniform
► Triangle

BUILDING STRONG®
Discrete Distribution Examples
 Unbounded
► None
 Left Bounded
► Poisson
► Negative binomial
► Geometric
 Bounded
► Binomial
► Hypergeometric
► Discrete
► Discrete Uniform

BUILDING STRONG®
Are There Parameters
►Parametric--shape is
determined by mathematics
of a conceptual probability
model
►Non-parametric—empirical
distributions whose
mathematics is defined by
the shape required
Does my
variable have
meaningful
parameters?

BUILDING STRONG®
Parametric and Non-Parametric
 Normal
 Lognormal
 Exponential
 Poisson
 Binomial
 Gamma
 Uniform
 Pert
 Triangular
 Cumulative

BUILDING STRONG®
Do You Know the Parameters?
 1st order distribution-
known parameters
►Risknormal(100,10)
 2nd order distribution-
uncertain parameters
► Risknormal(risktriang(90,100,10
3),riskuniform(8,11))
Do I know the
parameters?

BUILDING STRONG®
Choose a parametric distribution
under these circumstances
 Theory supports choice
 Distribution proven accurate
for modelling your specific
variable
 Distribution matches observed
data well
 Need distribution with tail
extending beyond the
observed minimum or
maximum

BUILDING STRONG®
Choose a nonparametric
distribution under these
circumstances
 Theory is lacking
 There is no commonly used
model
 Data are severely limited
 Knowledge is limited to general
beliefs and some evidence

BUILDING STRONG®
Is It Dependent on Other
Variables?
►Univariate--not
probabilistically linked
to any other variable in
the model
►Multivariate--are
probabilistically linked
in some way
►Engineering
relationships are often
multivariate
Do the values of
my variable depend
on the values of
other variables?

BUILDING STRONG®
Distribution Type
No
Bounds
Bounded
Left &
Right
Left
Bound
Only Category Shape
Empirical
Distribution
Beta Continuous No Yes No Non-parametric Shape shifter No
Binomial Discrete No Yes No Parametric Some Flexibility No
Chi Squared Continuous No No Yes Parametric Basic Shape No
Cumulative ascending Continuous No Yes No Non-parametric Shape shifter Yes
Cumulative descending Continuous No Yes No Non-parametric Shape shifter Yes
Discrete Discrete No Yes No Non-parametric Shape shifter Yes
Discrete uniform Discrete No Yes No Non-parametric Basic Shape No
Erlang Continuous No No Yes Parametric Basic Shape No
Error Continuous Yes No No Parametric Some Flexibility No
Exponential Continuous No No Yes Parametric Basic Shape No
Extreme value Continuous No No Yes Parametric Basic Shape No
Gamma Continuous No No Yes Parametric Shape shifter No
General Continuous No Yes No Non-parametric Shape shifter Yes
Geometric Discrete No No Yes Parametric Some Flexibility No
Histogram Continuous No Yes No Non-parametric Shape shifter Yes
Hypergeometric Discrete No Yes No Parametric Some Flexibility No
Integer uniform Discrete No Yes No Non-parametric Basic Shape No
Inverse Gaussian Continuous No No Yes Parametric Basic Shape No
Logarithmic Discrete No No Yes Parametric Some Flexibility No
Logistic Continuous Yes No No Parametric Basic Shape No
Lognormal Continuous No No Yes Parametric Basic Shape No
Lognormal2 Continuous No No Yes Parametric Basic Shape No
Negative Binomial Discrete No No Yes Parametric Some Flexibility No
Normal Continuous Yes No No Parametric Basic Shape No
Pareto Continuous No No Yes Parametric Basic Shape No
Pareto2 Continuous No No Yes Parametric Basic Shape No
Pearson V Continuous No No Yes Parametric Some Flexibility No
Pearson VI Continuous No No Yes Parametric Some Flexibility No
PERT Continuous No Yes No Non-parametric Shape shifter No
Poisson Discrete No No Yes Parametric Basic Shape No
Rayleigh Continuous No No Yes Parametric Basic Shape No
Student Continuous Yes No No Parametric Basic Shape No
Triangle (various) Continuous No Yes No Non-parametric Some Flexibility No
Weibull Continuous No No Yes Parametric Shape shifter No
How can I make
this table
legible?

BUILDING STRONG®
Always plot
your data.
 Don’t just calculate
Mean & SD and
assume its normal

BUILDING STRONG®
Look for
distinctive
shapes and
features of your
data.
►Single peaks
►Symmetry
►Positive skew
►Negative values
Gamma,
Weibull, and
beta are useful
and flexible
functions.

BUILDING STRONG®
►Low coefficient of variation
& mean = median => Normal
►Positive skew & mean =
standard deviation =>
Exponential
►Consider outliers
Try calculating some
statistics to get a feel.

BUILDING STRONG®
 Formal theory, e.g., CLT
 Theoretical knowledge of
the variable
►Behavior or math
 Informal theory
►Sums normal,
products lognormal
►Study specific
►Your best documented
thoughts on subject
Theory is your most
compelling reason for
choosing a distribution.

BUILDING STRONG®
Outliers
0 100 200 300 400 500 600
House Value
Extreme observations can
drastically influence a
probability model
What are
these points
telling you?
► What about your
world-view is
inconsistent with
this result?
► Should you
reconsider your
perspective?
► What possible
explanations
have you not yet
considered?

BUILDING STRONG®
If observation is an
error, remove it.
Your explanation must
be correct, not just
plausible.
If you must keep it and
can’t explain it, just live
with the results.

BUILDING STRONG®
Previous Experience
Have I dealt
with this before?
What did other risk
assessments use?
What does the
Literature reveal?

BUILDING STRONG®
Goodness of Fit
 Provides statistical evidence to test
hypothesis that your data could have
come from a specific distribution
 H0 these data come from an “x”
distribution
 Small test statistic and large p mean
accept H0
 It is another piece of evidence not a
determining factor
Geek slide
warning

BUILDING STRONG®
Hypothesis
The data (blue) come from
the population (red)

BUILDING STRONG®
GOF Tests
 Akaike Information
Criterion (AIC)
 Bayesian Information
Criterion (BIC)
 Chi-Square Test
 Kolomogorov-Smirnov
Test
 Andersen-Darling Test
Geek slide
warning
Better fit for means
than tails
Better fit at extreme
tails of distribution
Better fit for means
than tails
Use AIC or BIC
unless you have
reason not to

BUILDING STRONG®
►Data never collected
►Data too expensive or
impossible
►Past data irrelevant
►Opinion needed to fill
holes in sparse data
►New area of inquiry,
unique situation that
never existed
Why use
an expert?

BUILDING STRONG®
 The distribution itself
►E.g. population is
normal
 Parameters of the
distribution
►E.g. mean is x and
standard deviation is y
►E.g. 5th, 50th, 95th
percentile values
What might
an expert
estimate?

BUILDING STRONG®
 Unsure which
distribution is best?
 Try several
►If no difference you
are free to use any
one
►Significant
differences mean
doing more work
A final strategy is to test
the sensitivity of your
results to the uncertain
probability model.

BUILDING STRONG®51
 Do exercise

BUILDING STRONG®
Example
28.1 28.3 29.0 28.8 29.1 28.8 28.9 28.7 28.5 28.8 28.3
28.6 29.1 28.9 28.7 29.3 28.9 29.4 29.4 28.8 29.5 28.5
29.0 28.6 28.7 29.1 29.1 28.6 28.6 28.8 29.2 29.2 28.4
29.5 28.9 28.9 28.9 28.7 28.3 28.7 28.9 28.4 28.7 29.2
29.2 29.7 29.1 29.2 28.5 28.9 28.6 28.7 28.5 28.5 29.1
28.8 28.2 28.6 28.9 29.5 28.9 29.0 29.1 28.8 29.2 28.7
29.3 28.7 28.8 29.1 29.4 29.0 29.2 28.8 28.5 29.0 29.3
29.1 29.0 28.6 28.6 29.5 29.5 28.8 29.6 29.0 29.5 28.7
28.9 28.2 29.2 29.0 28.7 28.9 28.6 28.5 29.6 29.6 28.3
28.7 29.0 29.0 29.3 28.5 28.9 28.4 28.7 28.9 28.9 29.0
Summer Water Temperature in Degrees Celsius

BUILDING STRONG®
Know Your Data
 Data are continuous.
 Practical minimum and maximum temperatures for these
coastal waters during the summer are indefinite. Fuzzy
bounds mean you can treat quantity as unbounded over a
limited range of the number line.
 Choose a normal (parametric) distribution because some
theory supports choice, distribution matches the observed data
well, and distribution has a tail extending beyond the observed
minimum and maximum. That information will be revealed in
subsequent steps.
 Parameters estimates mean a first order distribution.
 Univariate.

BUILDING STRONG®
Look at It
0
2
4
6
8
10
12
14
16
18
Percent
Daily Maximum Water Temperature
Water Temperature (Degrees Celsius)

BUILDING STRONG®
Look Again!
27.8 28 28.2 28.4 28.6 28.8 29 29.2 29.4 29.6
Water Temperature Boxplot
The boxplot confirms a slight skew to the
left in the sample data and in the
interquartile range.

BUILDING STRONG®
Theory
 The “system” that produces a daily
maximum temperature is complex
enough and random enough that we
believe deviations about the mean are
likely to be symmetrical.

BUILDING STRONG®
Statistics
Descriptive statistics
Count 100
Mean 28.8
Median 28.8
Sample standard deviation 0.3
Minimum 28.1
Maximum 29.5
Range 1.4
Standard error of the mean 0.027
Confidence interval 95.% lower 28.7
Confidence interval 95.% upper 28.9
Coefficient of variation 0.9
 Mean and median are
approximately equal
 Coefficient of variation (0
to 100+ scale) is small.
 There are no outliers in
this dataset.

BUILDING STRONG®
GOF
Normal Weibull Logistic ExtValue Pareto
Chi-Squared Test
Chi-Sq Statistic 7.14 8.02 11.98 17.26 121.89
P-Value 0.7122 0.6269 0.2864 0.0688 0
Cr. Value @ 0.750 6.7372 6.7372 6.7372 6.7372 6.7372
Cr. Value @ 0.500 9.3418 9.3418 9.3418 9.3418 9.3418
Cr. Value @ 0.250 12.5489 12.5489 12.5489 12.5489 12.5489
Cr. Value @ 0.150 14.5339 14.5339 14.5339 14.5339 14.5339
Cr. Value @ 0.100 15.9872 15.9872 15.9872 15.9872 15.9872
Cr. Value @ 0.050 18.307 18.307 18.307 18.307 18.307
Cr. Value @ 0.025 20.4832 20.4832 20.4832 20.4832 20.4832
Cr. Value @ 0.010 23.2093 23.2093 23.2093 23.2093 23.2093
Cr. Value @ 0.005 25.1882 25.1882 25.1882 25.1882 25.1882
Cr. Value @ 0.001 29.5883 29.5883 29.5883 29.5883 29.5883
Chi-Square Test Results

BUILDING STRONG®
Choice
 An expert opinion is not necessary.
 Sensitivity analysis will not be necessary.
 The normal distribution is continuous,
parametric, consistent with my theory,
successfully used in the past, and
statistically consistent with a my data.
Therefore I will use it.

BUILDING STRONG®
Take Away Points
 Choosing the best distribution is where
most new risk assessors feel least
comfortable.
 Choice of distribution matters.
 Distributions come from data and
expert opinion.
 Distribution fitting should never be the
basis for distribution choice.

Choosing a Probability Distribution - Charles Yoe

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Choosing a Probability Distribution - Charles Yoe

Similar to Choosing a Probability Distribution - Charles Yoe (20)

Choosing a Probability Distribution - Charles Yoe