Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Frequentist inference only seems easy By John Mount

This is part of Alpine ML Talk Series:
The talk is called “Frequentist inference only seems easy” and is about the theory of simple statistical inference (based on material from this article ). The talk includes some simple dice games (I bring dice!) that really break the rote methods commonly taught as statistics. This is actually a good thing, as it gives you time and permission to work out how common statistical methods are properly derived from basic principles. This takes a little math (which I develop in the talk), but it changes some statistics from "do this" to "here is why you calculate like this.” It should appeal to people interested in the statistical and machine learning parts of data science.

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

Frequentist inference only seems easy By John Mount

  1. 1. Frequentist estimation only seems easy John Mount Win-Vector LLC 1 Outline First example problem: estimating the success rate of coin flips. Second example problem: estimating the success rate of a dice game. Interspersed in both: an entomologist’s view of lots of heavy calculation. Image from “HOW TO PIN AND LABEL ADULT INSECTS” Bambara, Blinn, 4H/insect_pinning4a.html 2 This talk is going to alternate between simple probability games (like rolling dice) and the detailed calculations needed to bring the reasoning forward. If you come away with two points from this talk remember: classic frequentist statistics is not as cut and dried as teacher claim (so it is okay to ask questions), and Bayesian statistics is not nearly as complicated as people make it appear. The point of this talk Statistics is a polished field where many of the foundations are no longer discussed. A lot of the “math anxiety” felt in learning statistics is from uncertainty about these foundations, and how they actually lead to common practices. We are going to discuss common simple statistical goals (correct models, unbiasedness, low error) and how the lead to common simple statistical procedures. The surprises (at least for me) are: There is more than one way to do things. The calculations needed to justify how even simple procedures are derived from the goals are in fact pretty involved. 3 A lot of the pain of learning is being told there is only “one way” (when there is more than one) and that a hard step (linking goals to procedures) is easy (when in fact is is hard). Statistics would be easier to teach if those two things were true, but they are not. However, not addressing these issues makes learning statistics harder than it has to be. We are going to spend some time on what are appropriate statistical goals, and how they lead to common statistical procedures (instead of claiming everything is obvious). You won’t be expected do invent the math, but you need to accept that it is in fact hard to justify common statistical procedures without somebody having already done the math. And I’ll be honest I am a math for math’s sake guy.
  2. 2. What you will get from this presentation Simple puzzles that present problems for the common rules of estimating rates. Good for countering somebody who says “everything is easy and you just don’t get it.” Examples that expose strong consequences of the seemingly subtle differences in common statistical estimation methods. Makes understanding seemingly esoteric distinctions like Bayesianism and frequentism much easier. A taste of some of the really neat math used to establish common statistics. A revival of Wald game-theoretic style inference (as described in Savage “The Foundations of Statistics”). 4 You will get to roll the die, and we won’t make you do the heavy math. Aside: we have been telling people that one of the things that makes data science easy is large data sets allow you to avoid some of the hard math in small sample size problems. Here we work through some of the math. In practice you do get small sample size issues even in large data sets due to heavy-tail like phenomena and when you introducing conditioning and segmentation (themselves typical modeling steps). First example: coin flip game 5 Why do we even care? The coin problem is a stand-in for something that that is probably important to us: such as estimating the probability of a sale given features and past experience: P[ sale | features,evidence ]. Being able to efficiently form good estimates that combine domain knowledge, current features and past data is the ultimate goal of analytics/data-science. 6
  3. 3. The coin problem You are watching flips of a coin and want to estimate the probability p that the coin comes up heads. For example: "T" "T" "H" "T" "H" "T" "H" "H" "T" “T" Easy to apply! Sufficient statistic: 4 heads, 6 tails Frequentist estimate of p: p ~ heads/(heads+tails) = 0.4 Done. Thanks for your time. 7 # R code set.seed(2014) sample = rbinom(10,1,0.5) print(ifelse(sample>0.5,'H','T')) Wait, how did we know to do that? Why is it obvious h/(h+t) is the best estimate of the unknown true value of p? 8 Fundamental problem: a mid-range probability prediction (say a number in the range 1/6 to 5/6) is not falsifiable by a single experiment. So: how do we know such statements actually have empirical content? The usual answers are performance on long sequences (frequentist), appeals to axioms of probability (essentially additivity of disjoint events), and subjective interpretations. Each view has some assumptions and takes some work. Checking whether a coin is fair - Wikipedia, the free encyclopedia 7/21/14, 12:26 PM is small when compared with the alternative hypothesis (a biased coin). However, it is not small enough to cause us to believe that the coin has a significant bias. Notice that this probability is slightly higher than our presupposition of the probability that the coin was fair corresponding to the uniform prior distribution, which was 10%. Using a prior distribution that reflects our prior knowledge of what a coin is and how it acts, the posterior distribution would not favor the hypothesis of bias. However the number of trials in this example (10 tosses) is very small, and with more trials the choice of prior distribution would be somewhat less relevant.) Note that, with the uniform prior, the posterior probability distribution f(r | H = 7,T = 3) achieves its peak at r = h / (h + t) = 0.7; this value is called the maximum a posteriori (MAP) estimate of r. Also with the uniform prior, the expected value of r under the posterior distribution is The standard easy estimate comes from frequentism Plot of the probability density f(x | H = 7,T = 3) = 1320 x7 (1 - x)3 with x ranging from 0 to 1. The standard answer (this example from wiki/Checking_whether_a_coin_is_fair ): Estimator of true probability The best estimator for the actual value is the estimator . This estimator has a margin of error (E) where at a particular confidence level. Answer is correct and simple, but not good (as it lacks context, assumptions, goals, motivation and explanation). Stumper: without an appeal to authority how do we know to use the estimate of heads/(heads+tails). What problem is such an estimate solving (what criterion is it optimizing)? Using this approach, to decide the number of times the coin should be tossed, two parameters are required: 1. The confidence level which is denoted by confidence interval (Z) 2. The maximum (acceptable) error (E) The confidence level is denoted by Z and is given by the Z-value of a standard normal distribution. This 9 Notation is a bit different: here tau is the unknown true value and Page 4 of 8 p is the estimate. Throughout this talk by “coin” we mean an abstract device that always returns one of two states. For Gelman and Nolan have an interesting article “You Can Load a Die, But You Can’t Bias a Coin” research/published/diceRev2.pdf about how hard it would be to bias an actual coin that you allow somebody else to flip (and how useless articles testing the fairness of the new Euro were).
  4. 4. Also, there are other common estimates Examples: A priori belief: p ~ 0.5 regardless of evidence. Bayesian (Jeffreys prior) estimate: p ~ (heads+0.5)/(heads+tails +1) = 0.4090909 Laplace smoothed estimate: p ~ (heads+1)/(heads+tails+2) = 0.4166667 Game theory minimax estimates (more on this later in this talk). The classic frequentist estimate is not the only acceptable estimate. 10 Each of these has its merits. A prior belief has the least sampling noise (as it ignores the data). Bayesian with Jeffreys prior very roughly tries to maximize the amount of information captured in the first observation. Laplace smoothing minimizes expected square error under a uniform prior. Each different estimate has its own characteristic justification From “The Cartoon Guide to Statistics” Gonick and Smith. 11 If all of the estimates where “fully compatible” with each other then they would all be identical. Which they clearly are not. Notice we are discussing difference in estimates here- not differences in significances or hypothesis tests. Also Bayesian priors are not always subjective beliefs (Wald in particular used an operational definition). The standard story There are 1 to 2 ways to do statistics: frequentism and maybe Bayesianism. In frequentist estimation the unknown quantity to be estimated is fixed at a single value and the experiment is considered a repeatable event (with different possible measurements on each possible). All probabilities are over possible repetitions of experiment with observations changing. In Bayesian estimation the unknown quantity to be estimated is assumed to have non-trivial distribution and the experimental results are considered fixed. All probabilities are over possible values of the quantity to be estimated. Priors talk about the assumed distribution before measurement, posteriors talk about the distribution conditioned on the measurements. 12 There are other differences: such as preference of point-wise estimates versus full descriptions of distribution. And these are not the only possible models.
  5. 5. Our coin example again I flip a coin a single time and it comes up heads- what is my best estimate of the probability the coin comes up heads in repeated flips? “Classic”/naive probability: 0.5 (independent of observations/ data) Frequentist: 1.0 Bayesian (Jeffreys prior): 0.75 13 Laws that are correct are correct in the extreme cases. (if we have distributed 6-sided dice) Lets try this. Everybody roll your die. If it comes up odd you win and even you lose. Okay somebody who one raise your hand. Each one of you if purely frequentist estimates 100% chance of winning this game (if you stick only to data from your die). Now please put your hands down. Everybody who did not win, how do you feel about the estimate of 100% chance of winning? What is the frequentist estimate optimizing? "Bayesian Data Analysis" 3rd Edition,Gelman, Carlin, Stern, Dunson, Vehtari, Rubin p. 92 states that frequentist estimates are designed to be consistent (as the sample size increases they converge to the unknown value), efficient (they tend to minimize loss or expected square-error), or even have asymptotic unbiasedness (the difference in the estimate from the true value converges to zero as the experiment size increases, even when re-scaled by the shrinking standard error of the estimate). If we think about it: frequentism is interpreting probabilities as limits of rates of repeated experiments. In this form bias is an especially bad form of error as it doesn’t average out. 14 Why not minimize L1 error? Because this doesn’t always turn out to be unbiased (or isn’t always a regression). Bayesians can allow bias. The saving idea: is don’t average estimators, but aggregate data and form a new estimate. Frequentist concerns: bias and efficiency (variance) From:“The Cambridge Dictionary of Statistics” 2nd Edition, B.S. Everitt. Bias: An estimator for which E[ˆ✓] = ✓ is said to be unbiased. Efficiency: A term applied in the context of comparing di↵erent methods of estimating the same parameter; the estimate with the lowest variance being regarded as the most efficient. 15 There is more than one unbiased estimate. For example a grand average (unconditioned by features) is an unbiased estimate.
  6. 6. A good motivation of the frequentist estimate Adapted from “Schaum’s Outlines Statistics” 4th Edition, Spiegel, Stephens, pp. 204-205. SAMPLING DISTRIBUTIONS OF MEANS Suppose that all possible samples of size N are drawn without replacement from a finite population of size Np > N. If we denote the mean and stan-dard deviation of the sampling distribution of means by E[ˆμ] and E[ˆ