SlideShare a Scribd company logo
1 of 60
US Army Corps of Engineers
BUILDING STRONG®
Choosing a Probability
Distribution
Charles Yoe
Professor of Economics, Notre Dame of
Maryland University
Huntington District
March, 2015
BUILDING STRONG®
Learning Objectives
 At the end of this session participants
will be able to:
 Systematically identify the best
distribution to use for a single uncertain
variable
BUILDING STRONG®
Quantitative risk assessment
requires you to use probability
BUILDING STRONG®
Using Probability
 Sometimes you will estimate the
probability of an event
 Sometimes you will use distributions to
►Describe data
►Model variability (range of consequences)
►Represent your uncertainty
Which distribution
should I use?
BUILDING STRONG®
Basic Distinctions
 Constant
 Variables
• Some things vary predictably
• Some things vary unpredictably
 Random variables
• It can be something known but not
known by us
BUILDING STRONG®
A probability distribution is
simply a list or picture of all
possible outcomes and their
corresponding probabilities
BUILDING STRONG®
When the outcomes are
continuous, like lockage
times, then the notion of
probability takes on some
subtleties.
BUILDING STRONG®
Probability Mass Function
when sample space consists
of discrete outcomes, number
of tainter gate failures this year,
we can talk about the probability
of each outcome.
BUILDING STRONG®
Probability Density Function
For continuous outcome
spaces, we can “discretize”
the space into a finite set
of mutually exclusive and
exhaustive “bins”. We can
divide lockage times into
intervals: 0-50, >50-100,
>100-103.7 and so on.
BUILDING STRONG®
Probability
Density
Function
Cumulative
Distribution
Function
Survival Function
BUILDING STRONG®
This measures the x variable value
This measures p(x), for a discrete variable
this is the probability of a corresponding x
BUILDING STRONG®
This measures the x variable value
This measures p(x), for a continuous variable
this is density and not the probability of x
BUILDING STRONG®
Notice the vertical axes in two normal distributions with a mean of 100.
When the SD is small (right) many more values must get ‘packed’ into a
smaller interval, making the data much more dense, 4 vs. .4.
Thus, the vertical axis for a continuous distribution has
no meaning as a probability. It measures the density of
the data.
BUILDING STRONG®
Applying the Principles
 How many probability distributions can
we name without help?
BUILDING STRONG®
Checklist for Choosing a Distributions
From Some Data
1. Can you use your data?
2. Understand your variable
a) Source of data
b) Continuous/discrete
c) Bounded/unbounded
d) Meaningful
parameters
a) Do you know them? (1st
or 2nd order)
e) Univariate/multivariate
3. Look at your data—
plot it
4. Use theory
5. Calculate statistics
6. Use previous
experience
7. Distribution fitting
8. Expert opinion
9. Sensitivity analysis
BUILDING STRONG®
First!
 Do you have data germane to your
population?
 If so, do you need a distribution or can
you just use your data?
BUILDING STRONG®
One problem could be
bounding your data, if
you do not have the true
minimum and maximum
values.
Any dataset can be displayed as
a cumulative distribution function
or a general density function
BUILDING STRONG®
Display Either Way
BUILDING STRONG®19
Applying the Principles
 Go to @RISK
 What are advantages of PDF?
 CDF?
BUILDING STRONG®
Fitting Empirical Distribution to Data
 Rank data x(i) in
ascending order
 Calculate the percentile
for each value
 Use data and
percentiles to create
cumulative distribution
function
Data
Cumulative
Probability
Index Value F(x) = i/19
0 0 0
1 0.9 0.053
2 3.6 0.105
3 5.0 0.158
4 6.0 0.211
5 11.7 0.263
6 16.2 0.316
7 16.5 0.368
8 22.2 0.421
9 22.7 0.474
10 23.2 0.526
11 24.5 0.579
12 24.9 0.632
13 25.8 0.684
14 33.3 0.737
15 33.4 0.789
16 34.7 0.842
17 40.2 0.895
18 44.2 0.947
19 60.0 1
BUILDING STRONG®
For Thursday Everybody Buy
 M&M’s at UC Davis Store
 1.69 oz bag of Milk Chocolate M&M’s
BUILDING STRONG®
When You Can’t Use Your Data
 Given wide variety of
distributions it is not
always easy to select
the most appropriate
one
BUILDING STRONG®
Does Distribution Matter?
 Wrong assumption =>
 Incorrect results=>
 Poor decisions=>
 Undesirable outcomes
BUILDING STRONG®
Understand Your Data
►Experiments
►Observation
►Surveys
►Computer databases
►Literature searches
►Simulations
►Test case
Where did my
data come
from?
BUILDING STRONG®
►Discrete variables take one
of a set of identifiable values,
you can calculate its
probability of occurrence
►Continuous distributions- a
variable that can take any
value within a defined range,
you can’t calculate the
probability of a single value
Is my variable
discrete or
continuous?
BUILDING STRONG®
Barges in a tow
Houses in floodplain
People at a meeting
Results of a diagnostic test
Casualties per year
Relocations and acquisitions
Average number of barges per tow
Weight of an adult striped bass
Sensitivity or specificity of a diagnostic test
Transit time
Expected annual damages
Duration of a storm
Shoreline eroded
Sediment loads
BUILDING STRONG®
Choose Distribution That
Matches
BUILDING STRONG®
What Values Are Possible?
►Bounded-value confined to
lie between two determined
values
►Unbounded-value
theoretically extends from
minus infinity to plus infinity
►Partially bounded-
constrained at one end
(truncated distributions)
Is my variable
bounded or
unbounded?
BUILDING STRONG®
Continuous Distribution
Examples
 Unbounded
► Normal
► t
► Logistic
 Left Bounded
► Chi-square
► Exponential
► Gamma
► Lognormal
► Weibull
 Bounded
► Beta
► Cumulative
► General/histogram
► Pert
► Uniform
► Triangle
BUILDING STRONG®
Discrete Distribution Examples
 Unbounded
► None
 Left Bounded
► Poisson
► Negative binomial
► Geometric
 Bounded
► Binomial
► Hypergeometric
► Discrete
► Discrete Uniform
BUILDING STRONG®
Are There Parameters
►Parametric--shape is
determined by mathematics
of a conceptual probability
model
►Non-parametric—empirical
distributions whose
mathematics is defined by
the shape required
Does my
variable have
meaningful
parameters?
BUILDING STRONG®
Parametric and Non-Parametric
 Normal
 Lognormal
 Exponential
 Poisson
 Binomial
 Gamma
 Uniform
 Pert
 Triangular
 Cumulative
BUILDING STRONG®
Do You Know the Parameters?
 1st order distribution-
known parameters
►Risknormal(100,10)
 2nd order distribution-
uncertain parameters
► Risknormal(risktriang(90,100,10
3),riskuniform(8,11))
Do I know the
parameters?
BUILDING STRONG®
Choose a parametric distribution
under these circumstances
 Theory supports choice
 Distribution proven accurate
for modelling your specific
variable
 Distribution matches observed
data well
 Need distribution with tail
extending beyond the
observed minimum or
maximum
BUILDING STRONG®
Choose a nonparametric
distribution under these
circumstances
 Theory is lacking
 There is no commonly used
model
 Data are severely limited
 Knowledge is limited to general
beliefs and some evidence
BUILDING STRONG®
Is It Dependent on Other
Variables?
►Univariate--not
probabilistically linked
to any other variable in
the model
►Multivariate--are
probabilistically linked
in some way
►Engineering
relationships are often
multivariate
Do the values of
my variable depend
on the values of
other variables?
BUILDING STRONG®
Distribution Type
No
Bounds
Bounded
Left &
Right
Left
Bound
Only Category Shape
Empirical
Distribution
Beta Continuous No Yes No Non-parametric Shape shifter No
Binomial Discrete No Yes No Parametric Some Flexibility No
Chi Squared Continuous No No Yes Parametric Basic Shape No
Cumulative ascending Continuous No Yes No Non-parametric Shape shifter Yes
Cumulative descending Continuous No Yes No Non-parametric Shape shifter Yes
Discrete Discrete No Yes No Non-parametric Shape shifter Yes
Discrete uniform Discrete No Yes No Non-parametric Basic Shape No
Erlang Continuous No No Yes Parametric Basic Shape No
Error Continuous Yes No No Parametric Some Flexibility No
Exponential Continuous No No Yes Parametric Basic Shape No
Extreme value Continuous No No Yes Parametric Basic Shape No
Gamma Continuous No No Yes Parametric Shape shifter No
General Continuous No Yes No Non-parametric Shape shifter Yes
Geometric Discrete No No Yes Parametric Some Flexibility No
Histogram Continuous No Yes No Non-parametric Shape shifter Yes
Hypergeometric Discrete No Yes No Parametric Some Flexibility No
Integer uniform Discrete No Yes No Non-parametric Basic Shape No
Inverse Gaussian Continuous No No Yes Parametric Basic Shape No
Logarithmic Discrete No No Yes Parametric Some Flexibility No
Logistic Continuous Yes No No Parametric Basic Shape No
Lognormal Continuous No No Yes Parametric Basic Shape No
Lognormal2 Continuous No No Yes Parametric Basic Shape No
Negative Binomial Discrete No No Yes Parametric Some Flexibility No
Normal Continuous Yes No No Parametric Basic Shape No
Pareto Continuous No No Yes Parametric Basic Shape No
Pareto2 Continuous No No Yes Parametric Basic Shape No
Pearson V Continuous No No Yes Parametric Some Flexibility No
Pearson VI Continuous No No Yes Parametric Some Flexibility No
PERT Continuous No Yes No Non-parametric Shape shifter No
Poisson Discrete No No Yes Parametric Basic Shape No
Rayleigh Continuous No No Yes Parametric Basic Shape No
Student Continuous Yes No No Parametric Basic Shape No
Triangle (various) Continuous No Yes No Non-parametric Some Flexibility No
Weibull Continuous No No Yes Parametric Shape shifter No
How can I make
this table
legible?
BUILDING STRONG®
Always plot
your data.
 Don’t just calculate
Mean & SD and
assume its normal
BUILDING STRONG®
Look for
distinctive
shapes and
features of your
data.
►Single peaks
►Symmetry
►Positive skew
►Negative values
Gamma,
Weibull, and
beta are useful
and flexible
functions.
BUILDING STRONG®
►Low coefficient of variation
& mean = median => Normal
►Positive skew & mean =
standard deviation =>
Exponential
►Consider outliers
Try calculating some
statistics to get a feel.
BUILDING STRONG®
 Formal theory, e.g., CLT
 Theoretical knowledge of
the variable
►Behavior or math
 Informal theory
►Sums normal,
products lognormal
►Study specific
►Your best documented
thoughts on subject
Theory is your most
compelling reason for
choosing a distribution.
BUILDING STRONG®
Outliers
0 100 200 300 400 500 600
House Value
Extreme observations can
drastically influence a
probability model
What are
these points
telling you?
► What about your
world-view is
inconsistent with
this result?
► Should you
reconsider your
perspective?
► What possible
explanations
have you not yet
considered?
BUILDING STRONG®
If observation is an
error, remove it.
Your explanation must
be correct, not just
plausible.
If you must keep it and
can’t explain it, just live
with the results.
BUILDING STRONG®
Previous Experience
Have I dealt
with this before?
What did other risk
assessments use?
What does the
Literature reveal?
BUILDING STRONG®
Goodness of Fit
 Provides statistical evidence to test
hypothesis that your data could have
come from a specific distribution
 H0 these data come from an “x”
distribution
 Small test statistic and large p mean
accept H0
 It is another piece of evidence not a
determining factor
Geek slide
warning
BUILDING STRONG®
Hypothesis
The data (blue) come from
the population (red)
BUILDING STRONG®
GOF Tests
 Akaike Information
Criterion (AIC)
 Bayesian Information
Criterion (BIC)
 Chi-Square Test
 Kolomogorov-Smirnov
Test
 Andersen-Darling Test
Geek slide
warning
Better fit for means
than tails
Better fit at extreme
tails of distribution
Better fit for means
than tails
Use AIC or BIC
unless you have
reason not to
BUILDING STRONG®
►Data never collected
►Data too expensive or
impossible
►Past data irrelevant
►Opinion needed to fill
holes in sparse data
►New area of inquiry,
unique situation that
never existed
Why use
an expert?
BUILDING STRONG®
 The distribution itself
►E.g. population is
normal
 Parameters of the
distribution
►E.g. mean is x and
standard deviation is y
►E.g. 5th, 50th, 95th
percentile values
What might
an expert
estimate?
BUILDING STRONG®
 Unsure which
distribution is best?
 Try several
►If no difference you
are free to use any
one
►Significant
differences mean
doing more work
A final strategy is to test
the sensitivity of your
results to the uncertain
probability model.
BUILDING STRONG®51
Applying the Principles
 Do exercise
BUILDING STRONG®
Example
28.1 28.3 29.0 28.8 29.1 28.8 28.9 28.7 28.5 28.8 28.3
28.6 29.1 28.9 28.7 29.3 28.9 29.4 29.4 28.8 29.5 28.5
29.0 28.6 28.7 29.1 29.1 28.6 28.6 28.8 29.2 29.2 28.4
29.5 28.9 28.9 28.9 28.7 28.3 28.7 28.9 28.4 28.7 29.2
29.2 29.7 29.1 29.2 28.5 28.9 28.6 28.7 28.5 28.5 29.1
28.8 28.2 28.6 28.9 29.5 28.9 29.0 29.1 28.8 29.2 28.7
29.3 28.7 28.8 29.1 29.4 29.0 29.2 28.8 28.5 29.0 29.3
29.1 29.0 28.6 28.6 29.5 29.5 28.8 29.6 29.0 29.5 28.7
28.9 28.2 29.2 29.0 28.7 28.9 28.6 28.5 29.6 29.6 28.3
28.7 29.0 29.0 29.3 28.5 28.9 28.4 28.7 28.9 28.9 29.0
Summer Water Temperature in Degrees Celsius
BUILDING STRONG®
Know Your Data
 Data are continuous.
 Practical minimum and maximum temperatures for these
coastal waters during the summer are indefinite. Fuzzy
bounds mean you can treat quantity as unbounded over a
limited range of the number line.
 Choose a normal (parametric) distribution because some
theory supports choice, distribution matches the observed data
well, and distribution has a tail extending beyond the observed
minimum and maximum. That information will be revealed in
subsequent steps.
 Parameters estimates mean a first order distribution.
 Univariate.
BUILDING STRONG®
Look at It
0
2
4
6
8
10
12
14
16
18
Percent
Daily Maximum Water Temperature
Water Temperature (Degrees Celsius)
BUILDING STRONG®
Look Again!
27.8 28 28.2 28.4 28.6 28.8 29 29.2 29.4 29.6
Daily Maximum Water Temperature
Water Temperature Boxplot
The boxplot confirms a slight skew to the
left in the sample data and in the
interquartile range.
BUILDING STRONG®
Theory
 The “system” that produces a daily
maximum temperature is complex
enough and random enough that we
believe deviations about the mean are
likely to be symmetrical.
BUILDING STRONG®
Statistics
Descriptive statistics
Count 100
Mean 28.8
Median 28.8
Sample standard deviation 0.3
Minimum 28.1
Maximum 29.5
Range 1.4
Standard error of the mean 0.027
Confidence interval 95.% lower 28.7
Confidence interval 95.% upper 28.9
Coefficient of variation 0.9
Daily Maximum Water Temperature
 Mean and median are
approximately equal
 Coefficient of variation (0
to 100+ scale) is small.
 There are no outliers in
this dataset.
BUILDING STRONG®
GOF
Normal Weibull Logistic ExtValue Pareto
Chi-Squared Test
Chi-Sq Statistic 7.14 8.02 11.98 17.26 121.89
P-Value 0.7122 0.6269 0.2864 0.0688 0
Cr. Value @ 0.750 6.7372 6.7372 6.7372 6.7372 6.7372
Cr. Value @ 0.500 9.3418 9.3418 9.3418 9.3418 9.3418
Cr. Value @ 0.250 12.5489 12.5489 12.5489 12.5489 12.5489
Cr. Value @ 0.150 14.5339 14.5339 14.5339 14.5339 14.5339
Cr. Value @ 0.100 15.9872 15.9872 15.9872 15.9872 15.9872
Cr. Value @ 0.050 18.307 18.307 18.307 18.307 18.307
Cr. Value @ 0.025 20.4832 20.4832 20.4832 20.4832 20.4832
Cr. Value @ 0.010 23.2093 23.2093 23.2093 23.2093 23.2093
Cr. Value @ 0.005 25.1882 25.1882 25.1882 25.1882 25.1882
Cr. Value @ 0.001 29.5883 29.5883 29.5883 29.5883 29.5883
Chi-Square Test Results
BUILDING STRONG®
Choice
 An expert opinion is not necessary.
 Sensitivity analysis will not be necessary.
 The normal distribution is continuous,
parametric, consistent with my theory,
successfully used in the past, and
statistically consistent with a my data.
Therefore I will use it.
BUILDING STRONG®
Take Away Points
 Choosing the best distribution is where
most new risk assessors feel least
comfortable.
 Choice of distribution matters.
 Distributions come from data and
expert opinion.
 Distribution fitting should never be the
basis for distribution choice.

More Related Content

What's hot

Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values  Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Smarten Augmented Analytics
 
Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos
butest
 

What's hot (20)

Dispersion 2
Dispersion 2Dispersion 2
Dispersion 2
 
Research Method for Business chapter 12
Research Method for Business chapter 12Research Method for Business chapter 12
Research Method for Business chapter 12
 
What is Binary Logistic Regression Classification and How is it Used in Analy...
What is Binary Logistic Regression Classification and How is it Used in Analy...What is Binary Logistic Regression Classification and How is it Used in Analy...
What is Binary Logistic Regression Classification and How is it Used in Analy...
 
Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values  Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values
 
Introduction to the t Statistic
Introduction to the t StatisticIntroduction to the t Statistic
Introduction to the t Statistic
 
What is the Multinomial-Logistic Regression Classification Algorithm and How ...
What is the Multinomial-Logistic Regression Classification Algorithm and How ...What is the Multinomial-Logistic Regression Classification Algorithm and How ...
What is the Multinomial-Logistic Regression Classification Algorithm and How ...
 
What Is Random Forest Classification And How Can It Help Your Business?
What Is Random Forest Classification And How Can It Help Your Business?What Is Random Forest Classification And How Can It Help Your Business?
What Is Random Forest Classification And How Can It Help Your Business?
 
Measurement&scaling
Measurement&scalingMeasurement&scaling
Measurement&scaling
 
200 chapter 7 measurement :scaling by uma sekaran
200 chapter 7 measurement :scaling by uma sekaran 200 chapter 7 measurement :scaling by uma sekaran
200 chapter 7 measurement :scaling by uma sekaran
 
Decision tree
Decision treeDecision tree
Decision tree
 
Lect9 Decision tree
Lect9 Decision treeLect9 Decision tree
Lect9 Decision tree
 
Malhotra08
Malhotra08Malhotra08
Malhotra08
 
Malhotra09
Malhotra09Malhotra09
Malhotra09
 
Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos
 
Back to the basics-Part2: Data exploration: representing and testing data pro...
Back to the basics-Part2: Data exploration: representing and testing data pro...Back to the basics-Part2: Data exploration: representing and testing data pro...
Back to the basics-Part2: Data exploration: representing and testing data pro...
 
CABT SHS Statistics & Probability - Mean and Variance of Sampling Distributio...
CABT SHS Statistics & Probability - Mean and Variance of Sampling Distributio...CABT SHS Statistics & Probability - Mean and Variance of Sampling Distributio...
CABT SHS Statistics & Probability - Mean and Variance of Sampling Distributio...
 
Measurement and scaling noncomparative scaling technique
Measurement and scaling noncomparative scaling techniqueMeasurement and scaling noncomparative scaling technique
Measurement and scaling noncomparative scaling technique
 
Decision Tree Analysis
Decision Tree AnalysisDecision Tree Analysis
Decision Tree Analysis
 
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
 
Bj research session 9 analysing quantitative
Bj research session 9 analysing quantitativeBj research session 9 analysing quantitative
Bj research session 9 analysing quantitative
 

Similar to Choosing a Probability Distribution - Charles Yoe

statisticsintroductionofbusinessstats.ppt
statisticsintroductionofbusinessstats.pptstatisticsintroductionofbusinessstats.ppt
statisticsintroductionofbusinessstats.ppt
voore ajay
 
Lecture3 Modelling Decision Processes
Lecture3 Modelling Decision ProcessesLecture3 Modelling Decision Processes
Lecture3 Modelling Decision Processes
Kodok Ngorex
 
Introduction to RandomForests 2004
Introduction to RandomForests 2004Introduction to RandomForests 2004
Introduction to RandomForests 2004
Salford Systems
 
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Sherri Gunder
 
3.2 measures of variation
3.2 measures of variation3.2 measures of variation
3.2 measures of variation
leblance
 

Similar to Choosing a Probability Distribution - Charles Yoe (20)

statisticsintroductionofbusinessstats.ppt
statisticsintroductionofbusinessstats.pptstatisticsintroductionofbusinessstats.ppt
statisticsintroductionofbusinessstats.ppt
 
Lecture3 Modelling Decision Processes
Lecture3 Modelling Decision ProcessesLecture3 Modelling Decision Processes
Lecture3 Modelling Decision Processes
 
Introduction to Statistics2312.ppt
Introduction to Statistics2312.pptIntroduction to Statistics2312.ppt
Introduction to Statistics2312.ppt
 
Introduction to Statistics23122223.ppt
Introduction to Statistics23122223.pptIntroduction to Statistics23122223.ppt
Introduction to Statistics23122223.ppt
 
Course Module 9: (Some) Statistics For Managers Who Hate Statistics
Course Module 9: (Some) Statistics For Managers Who Hate StatisticsCourse Module 9: (Some) Statistics For Managers Who Hate Statistics
Course Module 9: (Some) Statistics For Managers Who Hate Statistics
 
Applied statistics part 5
Applied statistics part 5Applied statistics part 5
Applied statistics part 5
 
Introduction to RandomForests 2004
Introduction to RandomForests 2004Introduction to RandomForests 2004
Introduction to RandomForests 2004
 
Parametric vs Nonparametric Tests: When to use which
Parametric vs Nonparametric Tests: When to use whichParametric vs Nonparametric Tests: When to use which
Parametric vs Nonparametric Tests: When to use which
 
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
 
sample_size_Determination .pdf
sample_size_Determination .pdfsample_size_Determination .pdf
sample_size_Determination .pdf
 
Sample Size And Gpower Module
Sample Size And Gpower ModuleSample Size And Gpower Module
Sample Size And Gpower Module
 
Mini datathon
Mini datathonMini datathon
Mini datathon
 
Mixed Effects Models - Random Intercepts
Mixed Effects Models - Random InterceptsMixed Effects Models - Random Intercepts
Mixed Effects Models - Random Intercepts
 
Dylan Wiliam's presentation at Ofqual's Chief Regulator's Report event
Dylan Wiliam's presentation at Ofqual's Chief Regulator's Report eventDylan Wiliam's presentation at Ofqual's Chief Regulator's Report event
Dylan Wiliam's presentation at Ofqual's Chief Regulator's Report event
 
3.2 measures of variation
3.2 measures of variation3.2 measures of variation
3.2 measures of variation
 
Machine Learning - Decision Trees
Machine Learning - Decision TreesMachine Learning - Decision Trees
Machine Learning - Decision Trees
 
Quant Data Analysis
Quant Data AnalysisQuant Data Analysis
Quant Data Analysis
 
"Portfolio Optimisation When You Don’t Know the Future (or the Past)" by Rob...
"Portfolio Optimisation When You Don’t Know the Future (or the Past)" by Rob..."Portfolio Optimisation When You Don’t Know the Future (or the Past)" by Rob...
"Portfolio Optimisation When You Don’t Know the Future (or the Past)" by Rob...
 
Machine Learning (Decisoion Trees)
Machine Learning (Decisoion Trees)Machine Learning (Decisoion Trees)
Machine Learning (Decisoion Trees)
 
Mini datathon - Bengaluru
Mini datathon - BengaluruMini datathon - Bengaluru
Mini datathon - Bengaluru
 

Choosing a Probability Distribution - Charles Yoe

  • 1. US Army Corps of Engineers BUILDING STRONG® Choosing a Probability Distribution Charles Yoe Professor of Economics, Notre Dame of Maryland University Huntington District March, 2015
  • 2. BUILDING STRONG® Learning Objectives  At the end of this session participants will be able to:  Systematically identify the best distribution to use for a single uncertain variable
  • 3. BUILDING STRONG® Quantitative risk assessment requires you to use probability
  • 4. BUILDING STRONG® Using Probability  Sometimes you will estimate the probability of an event  Sometimes you will use distributions to ►Describe data ►Model variability (range of consequences) ►Represent your uncertainty Which distribution should I use?
  • 5. BUILDING STRONG® Basic Distinctions  Constant  Variables • Some things vary predictably • Some things vary unpredictably  Random variables • It can be something known but not known by us
  • 6. BUILDING STRONG® A probability distribution is simply a list or picture of all possible outcomes and their corresponding probabilities
  • 7. BUILDING STRONG® When the outcomes are continuous, like lockage times, then the notion of probability takes on some subtleties.
  • 8. BUILDING STRONG® Probability Mass Function when sample space consists of discrete outcomes, number of tainter gate failures this year, we can talk about the probability of each outcome.
  • 9. BUILDING STRONG® Probability Density Function For continuous outcome spaces, we can “discretize” the space into a finite set of mutually exclusive and exhaustive “bins”. We can divide lockage times into intervals: 0-50, >50-100, >100-103.7 and so on.
  • 11. BUILDING STRONG® This measures the x variable value This measures p(x), for a discrete variable this is the probability of a corresponding x
  • 12. BUILDING STRONG® This measures the x variable value This measures p(x), for a continuous variable this is density and not the probability of x
  • 13. BUILDING STRONG® Notice the vertical axes in two normal distributions with a mean of 100. When the SD is small (right) many more values must get ‘packed’ into a smaller interval, making the data much more dense, 4 vs. .4. Thus, the vertical axis for a continuous distribution has no meaning as a probability. It measures the density of the data.
  • 14. BUILDING STRONG® Applying the Principles  How many probability distributions can we name without help?
  • 15. BUILDING STRONG® Checklist for Choosing a Distributions From Some Data 1. Can you use your data? 2. Understand your variable a) Source of data b) Continuous/discrete c) Bounded/unbounded d) Meaningful parameters a) Do you know them? (1st or 2nd order) e) Univariate/multivariate 3. Look at your data— plot it 4. Use theory 5. Calculate statistics 6. Use previous experience 7. Distribution fitting 8. Expert opinion 9. Sensitivity analysis
  • 16. BUILDING STRONG® First!  Do you have data germane to your population?  If so, do you need a distribution or can you just use your data?
  • 17. BUILDING STRONG® One problem could be bounding your data, if you do not have the true minimum and maximum values. Any dataset can be displayed as a cumulative distribution function or a general density function
  • 19. BUILDING STRONG®19 Applying the Principles  Go to @RISK  What are advantages of PDF?  CDF?
  • 20. BUILDING STRONG® Fitting Empirical Distribution to Data  Rank data x(i) in ascending order  Calculate the percentile for each value  Use data and percentiles to create cumulative distribution function Data Cumulative Probability Index Value F(x) = i/19 0 0 0 1 0.9 0.053 2 3.6 0.105 3 5.0 0.158 4 6.0 0.211 5 11.7 0.263 6 16.2 0.316 7 16.5 0.368 8 22.2 0.421 9 22.7 0.474 10 23.2 0.526 11 24.5 0.579 12 24.9 0.632 13 25.8 0.684 14 33.3 0.737 15 33.4 0.789 16 34.7 0.842 17 40.2 0.895 18 44.2 0.947 19 60.0 1
  • 21. BUILDING STRONG® For Thursday Everybody Buy  M&M’s at UC Davis Store  1.69 oz bag of Milk Chocolate M&M’s
  • 22. BUILDING STRONG® When You Can’t Use Your Data  Given wide variety of distributions it is not always easy to select the most appropriate one
  • 23. BUILDING STRONG® Does Distribution Matter?  Wrong assumption =>  Incorrect results=>  Poor decisions=>  Undesirable outcomes
  • 24. BUILDING STRONG® Understand Your Data ►Experiments ►Observation ►Surveys ►Computer databases ►Literature searches ►Simulations ►Test case Where did my data come from?
  • 25. BUILDING STRONG® ►Discrete variables take one of a set of identifiable values, you can calculate its probability of occurrence ►Continuous distributions- a variable that can take any value within a defined range, you can’t calculate the probability of a single value Is my variable discrete or continuous?
  • 26. BUILDING STRONG® Barges in a tow Houses in floodplain People at a meeting Results of a diagnostic test Casualties per year Relocations and acquisitions Average number of barges per tow Weight of an adult striped bass Sensitivity or specificity of a diagnostic test Transit time Expected annual damages Duration of a storm Shoreline eroded Sediment loads
  • 28. BUILDING STRONG® What Values Are Possible? ►Bounded-value confined to lie between two determined values ►Unbounded-value theoretically extends from minus infinity to plus infinity ►Partially bounded- constrained at one end (truncated distributions) Is my variable bounded or unbounded?
  • 29. BUILDING STRONG® Continuous Distribution Examples  Unbounded ► Normal ► t ► Logistic  Left Bounded ► Chi-square ► Exponential ► Gamma ► Lognormal ► Weibull  Bounded ► Beta ► Cumulative ► General/histogram ► Pert ► Uniform ► Triangle
  • 30. BUILDING STRONG® Discrete Distribution Examples  Unbounded ► None  Left Bounded ► Poisson ► Negative binomial ► Geometric  Bounded ► Binomial ► Hypergeometric ► Discrete ► Discrete Uniform
  • 31. BUILDING STRONG® Are There Parameters ►Parametric--shape is determined by mathematics of a conceptual probability model ►Non-parametric—empirical distributions whose mathematics is defined by the shape required Does my variable have meaningful parameters?
  • 32. BUILDING STRONG® Parametric and Non-Parametric  Normal  Lognormal  Exponential  Poisson  Binomial  Gamma  Uniform  Pert  Triangular  Cumulative
  • 33. BUILDING STRONG® Do You Know the Parameters?  1st order distribution- known parameters ►Risknormal(100,10)  2nd order distribution- uncertain parameters ► Risknormal(risktriang(90,100,10 3),riskuniform(8,11)) Do I know the parameters?
  • 34. BUILDING STRONG® Choose a parametric distribution under these circumstances  Theory supports choice  Distribution proven accurate for modelling your specific variable  Distribution matches observed data well  Need distribution with tail extending beyond the observed minimum or maximum
  • 35. BUILDING STRONG® Choose a nonparametric distribution under these circumstances  Theory is lacking  There is no commonly used model  Data are severely limited  Knowledge is limited to general beliefs and some evidence
  • 36. BUILDING STRONG® Is It Dependent on Other Variables? ►Univariate--not probabilistically linked to any other variable in the model ►Multivariate--are probabilistically linked in some way ►Engineering relationships are often multivariate Do the values of my variable depend on the values of other variables?
  • 37. BUILDING STRONG® Distribution Type No Bounds Bounded Left & Right Left Bound Only Category Shape Empirical Distribution Beta Continuous No Yes No Non-parametric Shape shifter No Binomial Discrete No Yes No Parametric Some Flexibility No Chi Squared Continuous No No Yes Parametric Basic Shape No Cumulative ascending Continuous No Yes No Non-parametric Shape shifter Yes Cumulative descending Continuous No Yes No Non-parametric Shape shifter Yes Discrete Discrete No Yes No Non-parametric Shape shifter Yes Discrete uniform Discrete No Yes No Non-parametric Basic Shape No Erlang Continuous No No Yes Parametric Basic Shape No Error Continuous Yes No No Parametric Some Flexibility No Exponential Continuous No No Yes Parametric Basic Shape No Extreme value Continuous No No Yes Parametric Basic Shape No Gamma Continuous No No Yes Parametric Shape shifter No General Continuous No Yes No Non-parametric Shape shifter Yes Geometric Discrete No No Yes Parametric Some Flexibility No Histogram Continuous No Yes No Non-parametric Shape shifter Yes Hypergeometric Discrete No Yes No Parametric Some Flexibility No Integer uniform Discrete No Yes No Non-parametric Basic Shape No Inverse Gaussian Continuous No No Yes Parametric Basic Shape No Logarithmic Discrete No No Yes Parametric Some Flexibility No Logistic Continuous Yes No No Parametric Basic Shape No Lognormal Continuous No No Yes Parametric Basic Shape No Lognormal2 Continuous No No Yes Parametric Basic Shape No Negative Binomial Discrete No No Yes Parametric Some Flexibility No Normal Continuous Yes No No Parametric Basic Shape No Pareto Continuous No No Yes Parametric Basic Shape No Pareto2 Continuous No No Yes Parametric Basic Shape No Pearson V Continuous No No Yes Parametric Some Flexibility No Pearson VI Continuous No No Yes Parametric Some Flexibility No PERT Continuous No Yes No Non-parametric Shape shifter No Poisson Discrete No No Yes Parametric Basic Shape No Rayleigh Continuous No No Yes Parametric Basic Shape No Student Continuous Yes No No Parametric Basic Shape No Triangle (various) Continuous No Yes No Non-parametric Some Flexibility No Weibull Continuous No No Yes Parametric Shape shifter No How can I make this table legible?
  • 38. BUILDING STRONG® Always plot your data.  Don’t just calculate Mean & SD and assume its normal
  • 39. BUILDING STRONG® Look for distinctive shapes and features of your data. ►Single peaks ►Symmetry ►Positive skew ►Negative values Gamma, Weibull, and beta are useful and flexible functions.
  • 40. BUILDING STRONG® ►Low coefficient of variation & mean = median => Normal ►Positive skew & mean = standard deviation => Exponential ►Consider outliers Try calculating some statistics to get a feel.
  • 41. BUILDING STRONG®  Formal theory, e.g., CLT  Theoretical knowledge of the variable ►Behavior or math  Informal theory ►Sums normal, products lognormal ►Study specific ►Your best documented thoughts on subject Theory is your most compelling reason for choosing a distribution.
  • 42. BUILDING STRONG® Outliers 0 100 200 300 400 500 600 House Value Extreme observations can drastically influence a probability model What are these points telling you? ► What about your world-view is inconsistent with this result? ► Should you reconsider your perspective? ► What possible explanations have you not yet considered?
  • 43. BUILDING STRONG® If observation is an error, remove it. Your explanation must be correct, not just plausible. If you must keep it and can’t explain it, just live with the results.
  • 44. BUILDING STRONG® Previous Experience Have I dealt with this before? What did other risk assessments use? What does the Literature reveal?
  • 45. BUILDING STRONG® Goodness of Fit  Provides statistical evidence to test hypothesis that your data could have come from a specific distribution  H0 these data come from an “x” distribution  Small test statistic and large p mean accept H0  It is another piece of evidence not a determining factor Geek slide warning
  • 46. BUILDING STRONG® Hypothesis The data (blue) come from the population (red)
  • 47. BUILDING STRONG® GOF Tests  Akaike Information Criterion (AIC)  Bayesian Information Criterion (BIC)  Chi-Square Test  Kolomogorov-Smirnov Test  Andersen-Darling Test Geek slide warning Better fit for means than tails Better fit at extreme tails of distribution Better fit for means than tails Use AIC or BIC unless you have reason not to
  • 48. BUILDING STRONG® ►Data never collected ►Data too expensive or impossible ►Past data irrelevant ►Opinion needed to fill holes in sparse data ►New area of inquiry, unique situation that never existed Why use an expert?
  • 49. BUILDING STRONG®  The distribution itself ►E.g. population is normal  Parameters of the distribution ►E.g. mean is x and standard deviation is y ►E.g. 5th, 50th, 95th percentile values What might an expert estimate?
  • 50. BUILDING STRONG®  Unsure which distribution is best?  Try several ►If no difference you are free to use any one ►Significant differences mean doing more work A final strategy is to test the sensitivity of your results to the uncertain probability model.
  • 51. BUILDING STRONG®51 Applying the Principles  Do exercise
  • 52. BUILDING STRONG® Example 28.1 28.3 29.0 28.8 29.1 28.8 28.9 28.7 28.5 28.8 28.3 28.6 29.1 28.9 28.7 29.3 28.9 29.4 29.4 28.8 29.5 28.5 29.0 28.6 28.7 29.1 29.1 28.6 28.6 28.8 29.2 29.2 28.4 29.5 28.9 28.9 28.9 28.7 28.3 28.7 28.9 28.4 28.7 29.2 29.2 29.7 29.1 29.2 28.5 28.9 28.6 28.7 28.5 28.5 29.1 28.8 28.2 28.6 28.9 29.5 28.9 29.0 29.1 28.8 29.2 28.7 29.3 28.7 28.8 29.1 29.4 29.0 29.2 28.8 28.5 29.0 29.3 29.1 29.0 28.6 28.6 29.5 29.5 28.8 29.6 29.0 29.5 28.7 28.9 28.2 29.2 29.0 28.7 28.9 28.6 28.5 29.6 29.6 28.3 28.7 29.0 29.0 29.3 28.5 28.9 28.4 28.7 28.9 28.9 29.0 Summer Water Temperature in Degrees Celsius
  • 53. BUILDING STRONG® Know Your Data  Data are continuous.  Practical minimum and maximum temperatures for these coastal waters during the summer are indefinite. Fuzzy bounds mean you can treat quantity as unbounded over a limited range of the number line.  Choose a normal (parametric) distribution because some theory supports choice, distribution matches the observed data well, and distribution has a tail extending beyond the observed minimum and maximum. That information will be revealed in subsequent steps.  Parameters estimates mean a first order distribution.  Univariate.
  • 54. BUILDING STRONG® Look at It 0 2 4 6 8 10 12 14 16 18 Percent Daily Maximum Water Temperature Water Temperature (Degrees Celsius)
  • 55. BUILDING STRONG® Look Again! 27.8 28 28.2 28.4 28.6 28.8 29 29.2 29.4 29.6 Daily Maximum Water Temperature Water Temperature Boxplot The boxplot confirms a slight skew to the left in the sample data and in the interquartile range.
  • 56. BUILDING STRONG® Theory  The “system” that produces a daily maximum temperature is complex enough and random enough that we believe deviations about the mean are likely to be symmetrical.
  • 57. BUILDING STRONG® Statistics Descriptive statistics Count 100 Mean 28.8 Median 28.8 Sample standard deviation 0.3 Minimum 28.1 Maximum 29.5 Range 1.4 Standard error of the mean 0.027 Confidence interval 95.% lower 28.7 Confidence interval 95.% upper 28.9 Coefficient of variation 0.9 Daily Maximum Water Temperature  Mean and median are approximately equal  Coefficient of variation (0 to 100+ scale) is small.  There are no outliers in this dataset.
  • 58. BUILDING STRONG® GOF Normal Weibull Logistic ExtValue Pareto Chi-Squared Test Chi-Sq Statistic 7.14 8.02 11.98 17.26 121.89 P-Value 0.7122 0.6269 0.2864 0.0688 0 Cr. Value @ 0.750 6.7372 6.7372 6.7372 6.7372 6.7372 Cr. Value @ 0.500 9.3418 9.3418 9.3418 9.3418 9.3418 Cr. Value @ 0.250 12.5489 12.5489 12.5489 12.5489 12.5489 Cr. Value @ 0.150 14.5339 14.5339 14.5339 14.5339 14.5339 Cr. Value @ 0.100 15.9872 15.9872 15.9872 15.9872 15.9872 Cr. Value @ 0.050 18.307 18.307 18.307 18.307 18.307 Cr. Value @ 0.025 20.4832 20.4832 20.4832 20.4832 20.4832 Cr. Value @ 0.010 23.2093 23.2093 23.2093 23.2093 23.2093 Cr. Value @ 0.005 25.1882 25.1882 25.1882 25.1882 25.1882 Cr. Value @ 0.001 29.5883 29.5883 29.5883 29.5883 29.5883 Chi-Square Test Results
  • 59. BUILDING STRONG® Choice  An expert opinion is not necessary.  Sensitivity analysis will not be necessary.  The normal distribution is continuous, parametric, consistent with my theory, successfully used in the past, and statistically consistent with a my data. Therefore I will use it.
  • 60. BUILDING STRONG® Take Away Points  Choosing the best distribution is where most new risk assessors feel least comfortable.  Choice of distribution matters.  Distributions come from data and expert opinion.  Distribution fitting should never be the basis for distribution choice.