Frequentist inference only seems easy By John Mount
This is part of Alpine ML Talk Series:
The talk is called “Frequentist inference only seems easy” and is about the theory of simple statistical inference (based on material from this article http://www.win-vector.com/blog/2014/07/frequenstist-inference-only-seems-easy/ ). The talk includes some simple dice games (I bring dice!) that really break the rote methods commonly taught as statistics. This is actually a good thing, as it gives you time and permission to work out how common statistical methods are properly derived from basic principles. This takes a little math (which I develop in the talk), but it changes some statistics from "do this" to "here is why you calculate like this.” It should appeal to people interested in the statistical and machine learning parts of data science.
Frequentist inference only seems easy By John Mount
Frequentist estimation only
First example problem: estimating the success rate of coin flips.
Second example problem: estimating the success rate of a dice
Interspersed in both: an entomologist’s view of lots of heavy
Image from “HOW TO PIN AND LABEL ADULT INSECTS”
Bambara, Blinn, http://www.ces.ncsu.edu/depts/ent/notes/
2 This talk is going to alternate between simple probability games
(like rolling dice) and the detailed calculations needed to bring
the reasoning forward. If you come away with two points from
this talk remember: classic frequentist statistics is not as cut and
dried as teacher claim (so it is okay to ask questions), and
Bayesian statistics is not nearly as complicated as people make it
The point of this talk
Statistics is a polished field where many of the foundations are no
A lot of the “math anxiety” felt in learning statistics is from uncertainty
about these foundations, and how they actually lead to common
We are going to discuss common simple statistical goals (correct
models, unbiasedness, low error) and how the lead to common simple
The surprises (at least for me) are:
There is more than one way to do things.
The calculations needed to justify how even simple procedures
are derived from the goals are in fact pretty involved.
3 A lot of the pain of learning is being told there is only “one
way” (when there is more than one) and that a hard step (linking
goals to procedures) is easy (when in fact is is hard). Statistics
would be easier to teach if those two things were true, but they
are not. However, not addressing these issues makes learning
statistics harder than it has to be. We are going to spend some
time on what are appropriate statistical goals, and how they lead
to common statistical procedures (instead of claiming everything
is obvious). You won’t be expected do invent the math, but you
need to accept that it is in fact hard to justify common statistical
procedures without somebody having already done the math.
And I’ll be honest I am a math for math’s sake guy.
What you will get from this
Simple puzzles that present problems for the common rules of estimating rates.
Good for countering somebody who says “everything is easy and you just
don’t get it.”
Examples that expose strong consequences of the seemingly subtle differences
in common statistical estimation methods.
Makes understanding seemingly esoteric distinctions like Bayesianism and
frequentism much easier.
A taste of some of the really neat math used to establish common statistics.
A revival of Wald game-theoretic style inference (as described in Savage “The
Foundations of Statistics”).
4 You will get to roll the die, and we won’t make you do the heavy
math. Aside: we have been telling people that one of the things
that makes data science easy is large data sets allow you to avoid
some of the hard math in small sample size problems. Here we
work through some of the math. In practice you do get small
sample size issues even in large data sets due to heavy-tail like
phenomena and when you introducing conditioning and
segmentation (themselves typical modeling steps).
First example: coin flip game
Why do we even care?
The coin problem is a stand-in for something that that is probably
important to us: such as estimating the probability of a sale given
features and past experience: P[ sale | features,evidence ].
Being able to efficiently form good estimates that combine domain
knowledge, current features and past data is the ultimate goal of
The coin problem
You are watching flips of a coin and want to estimate the probability
p that the coin comes up heads.
For example: "T" "T" "H" "T" "H" "T" "H" "H" "T" “T"
Easy to apply!
Sufficient statistic: 4 heads, 6 tails
Frequentist estimate of p: p ~ heads/(heads+tails) = 0.4
Done. Thanks for your time.
7 # R code
sample = rbinom(10,1,0.5)
Wait, how did we know to do
Why is it obvious h/(h+t) is the best estimate of the unknown true
value of p?
8 Fundamental problem: a mid-range probability prediction (say a
number in the range 1/6 to 5/6) is not falsifiable by a single
experiment. So: how do we know such statements actually have
empirical content? The usual answers are performance on long
sequences (frequentist), appeals to axioms of probability
(essentially additivity of disjoint events), and subjective
interpretations. Each view has some assumptions and takes
Checking whether a coin is fair - Wikipedia, the free encyclopedia 7/21/14, 12:26 PM
is small when compared with the alternative hypothesis (a biased coin). However, it is not small enough to
cause us to believe that the coin has a significant bias. Notice that this probability is slightly higher than our
presupposition of the probability that the coin was fair corresponding to the uniform prior distribution, which
was 10%. Using a prior distribution that reflects our prior knowledge of what a coin is and how it acts, the
posterior distribution would not favor the hypothesis of bias. However the number of trials in this example (10
tosses) is very small, and with more trials the choice of prior
distribution would be somewhat less relevant.)
Note that, with the uniform prior, the posterior probability
distribution f(r | H = 7,T = 3) achieves its peak at
r = h / (h + t) = 0.7; this value is called the maximum a
posteriori (MAP) estimate of r. Also with the uniform prior,
the expected value of r under the posterior distribution is
The standard easy estimate
comes from frequentism
Plot of the probability density f(x | H = 7,T = 3) =
1320 x7 (1 - x)3 with x ranging from 0 to 1.
The standard answer (this example from http://en.wikipedia.org/
Estimator of true probability
The best estimator for the actual value is the estimator .
This estimator has a margin of error (E) where at a particular confidence level.
Answer is correct and simple, but not good (as it lacks context,
assumptions, goals, motivation and explanation).
Stumper: without an appeal to authority how do we know to use the
estimate of heads/(heads+tails). What problem is such an estimate
solving (what criterion is it optimizing)?
Using this approach, to decide the number of times the coin should be tossed, two parameters are required:
1. The confidence level which is denoted by confidence interval (Z)
2. The maximum (acceptable) error (E)
The confidence level is denoted by Z and is given by the Z-value of a standard normal distribution. This
9 Notation is a bit different: here tau is the unknown true value and
http://en.wikipedia.org/wiki/Checking_whether_a_coin_is_fair Page 4 of 8
p is the estimate. Throughout this talk by “coin” we mean an
abstract device that always returns one of two states. For Gelman
and Nolan have an interesting article “You Can Load a Die, But
You Can’t Bias a Coin” http://www.stat.columbia.edu/~gelman/
research/published/diceRev2.pdf about how hard it would be to
bias an actual coin that you allow somebody else to flip (and how
useless articles testing the fairness of the new Euro were).
Also, there are other common
A priori belief: p ~ 0.5 regardless of evidence.
Bayesian (Jeffreys prior) estimate: p ~ (heads+0.5)/(heads+tails
+1) = 0.4090909
Laplace smoothed estimate: p ~ (heads+1)/(heads+tails+2) =
Game theory minimax estimates (more on this later in this talk).
The classic frequentist estimate is not the only acceptable estimate.
10 Each of these has its merits. A prior belief has the least sampling
noise (as it ignores the data). Bayesian with Jeffreys prior very
roughly tries to maximize the amount of information captured in
the first observation. Laplace smoothing minimizes expected
square error under a uniform prior.
Each different estimate has its
own characteristic justification
From “The Cartoon Guide to Statistics”
Gonick and Smith.
11 If all of the estimates where “fully compatible” with each other
then they would all be identical. Which they clearly are not.
Notice we are discussing difference in estimates here- not
differences in significances or hypothesis tests. Also Bayesian
priors are not always subjective beliefs (Wald in particular used
an operational definition).
The standard story
There are 1 to 2 ways to do statistics: frequentism and maybe
In frequentist estimation the unknown quantity to be estimated is fixed
at a single value and the experiment is considered a repeatable event
(with different possible measurements on each possible).
All probabilities are over possible repetitions of experiment with
In Bayesian estimation the unknown quantity to be estimated is
assumed to have non-trivial distribution and the experimental results
are considered fixed.
All probabilities are over possible values of the quantity to be
estimated. Priors talk about the assumed distribution before
measurement, posteriors talk about the distribution conditioned on
12 There are other differences: such as preference of point-wise
estimates versus full descriptions of distribution. And these are
not the only possible models.
Our coin example again
I flip a coin a single time and it comes up heads- what is my best
estimate of the probability the coin comes up heads in repeated
“Classic”/naive probability: 0.5 (independent of observations/
Bayesian (Jeffreys prior): 0.75
13 Laws that are correct are correct in the extreme cases. (if we
have distributed 6-sided dice) Lets try this. Everybody roll your
die. If it comes up odd you win and even you lose. Okay
somebody who one raise your hand. Each one of you if purely
frequentist estimates 100% chance of winning this game (if you
stick only to data from your die). Now please put your hands
down. Everybody who did not win, how do you feel about the
estimate of 100% chance of winning?
What is the frequentist estimate
"Bayesian Data Analysis" 3rd Edition,Gelman, Carlin, Stern,
Dunson, Vehtari, Rubin p. 92 states that frequentist estimates are
designed to be consistent (as the sample size increases they
converge to the unknown value), efficient (they tend to minimize
loss or expected square-error), or even have asymptotic
unbiasedness (the difference in the estimate from the true value
converges to zero as the experiment size increases, even when re-scaled
by the shrinking standard error of the estimate).
If we think about it: frequentism is interpreting probabilities as limits
of rates of repeated experiments. In this form bias is an especially
bad form of error as it doesn’t average out.
14 Why not minimize L1 error? Because this doesn’t always turn out
to be unbiased (or isn’t always a regression).
Bayesians can allow bias. The saving idea: is don’t average
estimators, but aggregate data and form a new estimate.
Frequentist concerns: bias and
From:“The Cambridge Dictionary of Statistics” 2nd Edition, B.S. Everitt.
An estimator for which E[ˆ✓] = ✓ is said to be unbiased.
A term applied in the context of comparing di↵erent methods of estimating
the same parameter; the estimate with the lowest variance being regarded as
the most efficient.
15 There is more than one unbiased estimate. For example a grand
average (unconditioned by features) is an unbiased estimate.
A good motivation of the
Adapted from “Schaum’s Outlines Statistics” 4th Edition, Spiegel, Stephens,
SAMPLING DISTRIBUTIONS OF MEANS
Suppose that all possible samples of size N are drawn without replacement
from a finite population of size Np > N. If we denote the mean and stan-dard
deviation of the sampling distribution of means by E[ˆμ] and E[ˆ