This is part of Alpine ML Talk Series:
The talk is called “Frequentist inference only seems easy” and is about the theory of simple statistical inference (based on material from this article http://www.win-vector.com/blog/2014/07/frequenstist-inference-only-seems-easy/ ). The talk includes some simple dice games (I bring dice!) that really break the rote methods commonly taught as statistics. This is actually a good thing, as it gives you time and permission to work out how common statistical methods are properly derived from basic principles. This takes a little math (which I develop in the talk), but it changes some statistics from "do this" to "here is why you calculate like this.” It should appeal to people interested in the statistical and machine learning parts of data science.
D. Mayo: Philosophy of Statistics & the Replication Crisis in Sciencejemille6
D. Mayo discusses various disputes-notably the replication crisis in science-in the context of her just released book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...jemille6
Constructive role of replication crises teaches a lot about 1.) Non-fallacious uses of statistical tests, 2.) Rationale for the role of probability in tests, 3.) How to reformulate tests.
Replication Crises and the Statistics Wars: Hidden Controversiesjemille6
D. Mayo presentation at the X-Phil conference on "Reproducibility and Replicabililty in Psychology and Experimental Philosophy", University College London (June 14, 2018)
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
In the session "Philosophy of Science and the New Paradigm of Data-Driven Science at the American Statistical Association Conference on Statistical Learning and Data Science/Nonparametric Statistics
Byrd statistical considerations of the histomorphometric test protocol (1)jemille6
"Statistical considerations of the histomorphometric test protocol"
John E. Byrd, Ph.D. D-ABFA
Maria-Teresa Tersigni-Tarrant, Ph.D.
Central Identification Laboratory
JPAC
beyond objectivity and subjectivity; a discussion paperChristian Robert
This document discusses issues with the foundations of statistical analysis and modeling. It argues that statistical analysis often makes the wrong assumption that data is randomly generated by a probabilistic model. Additionally, there is too much focus on technical statistical details rather than providing approximate solutions that are useful to non-statistician users. The document advocates for a more subjective Bayesian approach that embraces uncertainty and variation rather than relying on tests and rigid models. It also calls for statistical analyses to be more transparent by explicitly stating all assumptions and modeling choices.
Spanos lecture 7: An Introduction to Bayesian Inference jemille6
This document provides an introduction to Bayesian inference through lecture notes on probability and statistics. It discusses three interpretations of probability: classical, degrees of belief, and frequency. The classical interpretation relies on an explicit chance mechanism with equally likely outcomes, which is too restrictive for empirical modeling. The degrees of belief interpretation considers probability as subjective beliefs, while the frequency interpretation views probability as the limit of relative frequencies in repeated experiments, as justified by the Strong Law of Large Numbers.
D. Mayo: Philosophy of Statistics & the Replication Crisis in Sciencejemille6
D. Mayo discusses various disputes-notably the replication crisis in science-in the context of her just released book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...jemille6
Constructive role of replication crises teaches a lot about 1.) Non-fallacious uses of statistical tests, 2.) Rationale for the role of probability in tests, 3.) How to reformulate tests.
Replication Crises and the Statistics Wars: Hidden Controversiesjemille6
D. Mayo presentation at the X-Phil conference on "Reproducibility and Replicabililty in Psychology and Experimental Philosophy", University College London (June 14, 2018)
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
In the session "Philosophy of Science and the New Paradigm of Data-Driven Science at the American Statistical Association Conference on Statistical Learning and Data Science/Nonparametric Statistics
Byrd statistical considerations of the histomorphometric test protocol (1)jemille6
"Statistical considerations of the histomorphometric test protocol"
John E. Byrd, Ph.D. D-ABFA
Maria-Teresa Tersigni-Tarrant, Ph.D.
Central Identification Laboratory
JPAC
beyond objectivity and subjectivity; a discussion paperChristian Robert
This document discusses issues with the foundations of statistical analysis and modeling. It argues that statistical analysis often makes the wrong assumption that data is randomly generated by a probabilistic model. Additionally, there is too much focus on technical statistical details rather than providing approximate solutions that are useful to non-statistician users. The document advocates for a more subjective Bayesian approach that embraces uncertainty and variation rather than relying on tests and rigid models. It also calls for statistical analyses to be more transparent by explicitly stating all assumptions and modeling choices.
Spanos lecture 7: An Introduction to Bayesian Inference jemille6
This document provides an introduction to Bayesian inference through lecture notes on probability and statistics. It discusses three interpretations of probability: classical, degrees of belief, and frequency. The classical interpretation relies on an explicit chance mechanism with equally likely outcomes, which is too restrictive for empirical modeling. The degrees of belief interpretation considers probability as subjective beliefs, while the frequency interpretation views probability as the limit of relative frequencies in repeated experiments, as justified by the Strong Law of Large Numbers.
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...jemille6
This document summarizes issues related to data-dependent selections and hypothesis testing. It discusses how preliminary inspection of data can influence test statistics and null hypotheses, potentially altering a test's ability to reliably detect discrepancies from the null. Two examples are provided:
1) "Hunting" through multiple independent tests and only reporting the most statistically significant result can incorrectly estimate the actual error rate as being much higher than the nominal rate of 5%.
2) Searching a DNA database and declaring a match with the first individual is different, as each non-match strengthens evidence for the inferred match. Adjusting is not needed as in the statistical "hunting" case.
Selection of cut-offs or model
This document discusses four waves in the philosophy of statistics from the 1930s to the present. The first wave centered around debates between Fisher, Neyman, Pearson, and others regarding hypothesis testing and the interpretation of p-values. The second wave from the 1960s-1980s involved criticisms of Neyman-Pearson methods regarding their applicability before and after obtaining data. The third wave from 1980-2005 explored likelihoodism and debates over the likelihood principle. The fourth and ongoing wave since 2005 continues philosophical debates in statistics.
The document discusses probability and statistical inference. It provides definitions and examples of key concepts in probability such as experiments, sample spaces, events, and rules for calculating probabilities of events. Examples include calculating the probability of getting a 7 when rolling two dice and the probability of testing positive for a disease given the accuracy of the test. The document also provides examples of applying probability concepts to problems involving cards and birthdays.
Slides given for Deborah G. Mayo talk at Minnesota Center for Philosophy of Science at University of Minnesota on the ASA 2016 statement on P-values and Error Statistics
The document discusses how statistical methodology and modeling can influence theories and findings. It notes that all statistical models are imperfect, statistical significance does not equal substantive significance, correlation does not imply causation, and data can be manipulated. Specifically, it warns about issues like cherry picking data, multiple testing inflating false positives, and nominal significance levels differing from actual error rates when accounting for selection effects. The document advocates evaluating statistical models based on how well they capture phenomena of interest and checking their adequacy despite violations of assumptions.
This document discusses random function models used in geostatistical estimation. It begins by explaining that estimation requires an underlying model to make inferences about unknown values that were not sampled. Geostatistical methods clearly state the probabilistic random function model on which they are based. The document then provides examples to illustrate deterministic and probabilistic models. Deterministic models can be used if the generating process is well understood, but most earth science data require probabilistic random function models due to uncertainty between sample locations. These models conceptualize the data as arising from a random process, even though the true processes are not truly random. The key aspects of the model that need to be specified are the possible outcomes and their probabilities.
This document discusses various forms of inexact knowledge and reasoning, including uncertainty, incomplete knowledge, defaults and beliefs, contradictory knowledge, and vague knowledge. It provides examples of how probabilistic reasoning, fuzzy logic, truth maintenance systems, certainty factors, and other approaches can be used to represent and reason with inexact knowledge. Key concepts covered include uncertainty, incomplete knowledge, defaults, beliefs, contradictory knowledge, and vague knowledge.
This document contains 54 multiple choice questions about probability concepts from the textbook "Quantitative Analysis for Management, 11e". The questions cover topics such as fundamental probability concepts, mutually exclusive and collectively exhaustive events, statistically independent events, probability distributions including binomial and normal distributions, and Bayes' theorem. For each question, the answer and difficulty level is provided along with the topic area.
Statistics is used to interpret data and draw conclusions about populations based on sample data. Hypothesis testing involves evaluating two statements (the null and alternative hypotheses) about a population using sample data. A hypothesis test determines which statement is best supported.
The key steps in hypothesis testing are to formulate the hypotheses, select an appropriate statistical test, choose a significance level, collect and analyze sample data to calculate a test statistic, determine the probability or critical value associated with the test statistic, and make a decision to reject or fail to reject the null hypothesis based on comparing the probability or test statistic to the significance level and critical value.
An example tests whether the proportion of internet users who shop online is greater than 40% using
Hypothesis testing involves developing a null hypothesis (H0) and an alternative hypothesis (Ha) to test a given situation. H0 states there is no difference, while Ha states there is a difference. Tests can be one-tailed or two-tailed. A two-tailed test rejects H0 if the sample mean is significantly different in either direction, while a one-tailed test only rejects if the difference is in the direction specified by Ha. When conducting a test, there is a risk of making a Type I error by rejecting a true H0, or a Type II error by failing to reject a false H0. The significance level determines the probability of a Type I error.
Hypothesis testing involves stating a null hypothesis (H0) and an alternative hypothesis (H1). H0 assumes there is no effect or relationship in the population. H1 states there is an effect. A study is conducted and statistics are used to determine if the data supports rejecting H0 in favor of H1. The p-value indicates the probability of obtaining results as extreme as the observed data or more extreme if H0 is true. If p ≤ the predetermined significance level (α = 0.05), H0 is rejected in favor of H1. Otherwise, H0 is retained but not proven true. Type I and II errors can occur when the true hypothesis is incorrectly rejected or retained.
This document discusses confidence intervals for estimating population parameters. It provides examples of constructing point and interval estimates for the population mean and proportion from sample data. Confidence intervals allow us to estimate a range of plausible values for the true population parameter based on the sample results and desired confidence level, rather than just a single point value. The width of the confidence interval depends on the sample size and confidence level, with larger samples and lower confidence levels producing narrower intervals.
The document discusses hypothesis testing and provides examples to illustrate the process. It explains how to state the research question and hypotheses, set the decision rule, calculate test statistics, decide if results are significant, and interpret the findings. An example tests if narcissistic individuals look in the mirror more often than others and finds they do based on a test statistic exceeding the critical value. A second example finds no significant difference in recovery time for patients with or without social support after surgery.
This document provides instructions for conducting two statistical tests: Spearman's Rank Correlation Coefficient and Chi-Square. Spearman's Rank is used to analyze the relationship between two variables like distance and environmental quality. It involves ranking values, calculating differences between ranks, and using a formula to determine if the relationship is statistically significant. Chi-Square analyzes relationships between categorical variables like opinions and demographics. It involves creating a results table, calculating expected values, applying a formula, and determining statistical significance based on degrees of freedom. Both tests are used to evaluate a null hypothesis of no relationship between variables.
MSL 5080, Methods of Analysis for Business Operations 1 .docxAASTHA76
MSL 5080, Methods of Analysis for Business Operations 1
Course Learning Outcomes for Unit II
Upon completion of this unit, students should be able to:
2. Distinguish between the approaches to determining probability.
Reading Assignment
Chapter 2: Probability Concepts and Applications, pp. 23–32
Unit Lesson
As you know, much in the world happens in amounts we can count—some in discrete numbers (1 item, 2
items, 3 items, never with a fraction of an item) or continuous ones (3.75 hours, 2.433333 hours) that could be
any fraction within a given range. Because of this, one can either calculate or estimate probability, which will
be the focus of this unit.
Probability
Who wants to calculate probability? Businesses (including farmers and ranchers raising crops and livestock),
governments, and anyone wanting to quantify risks in life will calculate probability. This includes people
involved in gaming as well. As you read, the textbook illustrates probability with coin tosses where the
outcomes are just two possible ones—heads or tails (Render, Stair, Hanna, & Hale, 2015). You may know
that gamblers have more complex probability problems to estimate a solution for—as in Texas Hold ‘Em,
where a cardholder may be calculating whether the remaining players are holding higher hands than his or
her own. There are answers available to the cardholder’s dilemma as well.
Probability is “a numerical statement about the likelihood that an event will occur” (Render et al., 2015, p. 24).
Mathematics can model this for us. Because some mathematical terms are equal to others, you can state the
formulas for certain probabilities as you see in Chapter 2 of the textbook. In the physical world, the probability
of anything is either 0 (cannot happen), 1 (100% chance of happening), or some fraction in between 0 and 1
(has a little/some/even/probable chance of happening). As something has to happen in every trial, the
probabilities added up for identical trials equal 1 for the series. A tossed coin has a 50% or .5 chance of
coming up heads, and the same 50% or .5 chance of coming up tails, but something will come up when the
coin is tossed—or, .5 + .5 = 1.
So for probability of the event = P(event):
0 ≤ P(event) ≤ 1
The probability to be calculated is in that range somewhere! Now, how do you find it? Here are two types of
approaches that fit what happens, the objective approach and subjective approach.
Types of Probability
Objective approach: The objective approach (when you can use numbers to calculate probability
directly) uses two common methods (Render et al., 2015):
1. The relative frequency method is used when you know how often things happen (as in the coin
example above, if you know how many times it was tossed), and
2. The classical or logical method is used when you know often things should happen (e.g., the
number of ways a coin will land, heads or tails, without knowing the numbe.
This document introduces probability and discusses different approaches to defining it. It notes that probability is used to describe variability and uncertainty when outcomes are not certain. Three common definitions of probability are discussed - classical, relative frequency, and subjective - along with their limitations. The document advocates treating probability as a mathematical system defined by axioms rather than worrying about numerical values until a specific application. It then outlines how to construct probability models using sample spaces and assigning probabilities to events based on their composition of simple events.
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...jemille6
This document summarizes issues related to data-dependent selections and hypothesis testing. It discusses how preliminary inspection of data can influence test statistics and null hypotheses, potentially altering a test's ability to reliably detect discrepancies from the null. Two examples are provided:
1) "Hunting" through multiple independent tests and only reporting the most statistically significant result can incorrectly estimate the actual error rate as being much higher than the nominal rate of 5%.
2) Searching a DNA database and declaring a match with the first individual is different, as each non-match strengthens evidence for the inferred match. Adjusting is not needed as in the statistical "hunting" case.
Selection of cut-offs or model
This document discusses four waves in the philosophy of statistics from the 1930s to the present. The first wave centered around debates between Fisher, Neyman, Pearson, and others regarding hypothesis testing and the interpretation of p-values. The second wave from the 1960s-1980s involved criticisms of Neyman-Pearson methods regarding their applicability before and after obtaining data. The third wave from 1980-2005 explored likelihoodism and debates over the likelihood principle. The fourth and ongoing wave since 2005 continues philosophical debates in statistics.
The document discusses probability and statistical inference. It provides definitions and examples of key concepts in probability such as experiments, sample spaces, events, and rules for calculating probabilities of events. Examples include calculating the probability of getting a 7 when rolling two dice and the probability of testing positive for a disease given the accuracy of the test. The document also provides examples of applying probability concepts to problems involving cards and birthdays.
Slides given for Deborah G. Mayo talk at Minnesota Center for Philosophy of Science at University of Minnesota on the ASA 2016 statement on P-values and Error Statistics
The document discusses how statistical methodology and modeling can influence theories and findings. It notes that all statistical models are imperfect, statistical significance does not equal substantive significance, correlation does not imply causation, and data can be manipulated. Specifically, it warns about issues like cherry picking data, multiple testing inflating false positives, and nominal significance levels differing from actual error rates when accounting for selection effects. The document advocates evaluating statistical models based on how well they capture phenomena of interest and checking their adequacy despite violations of assumptions.
This document discusses random function models used in geostatistical estimation. It begins by explaining that estimation requires an underlying model to make inferences about unknown values that were not sampled. Geostatistical methods clearly state the probabilistic random function model on which they are based. The document then provides examples to illustrate deterministic and probabilistic models. Deterministic models can be used if the generating process is well understood, but most earth science data require probabilistic random function models due to uncertainty between sample locations. These models conceptualize the data as arising from a random process, even though the true processes are not truly random. The key aspects of the model that need to be specified are the possible outcomes and their probabilities.
This document discusses various forms of inexact knowledge and reasoning, including uncertainty, incomplete knowledge, defaults and beliefs, contradictory knowledge, and vague knowledge. It provides examples of how probabilistic reasoning, fuzzy logic, truth maintenance systems, certainty factors, and other approaches can be used to represent and reason with inexact knowledge. Key concepts covered include uncertainty, incomplete knowledge, defaults, beliefs, contradictory knowledge, and vague knowledge.
This document contains 54 multiple choice questions about probability concepts from the textbook "Quantitative Analysis for Management, 11e". The questions cover topics such as fundamental probability concepts, mutually exclusive and collectively exhaustive events, statistically independent events, probability distributions including binomial and normal distributions, and Bayes' theorem. For each question, the answer and difficulty level is provided along with the topic area.
Statistics is used to interpret data and draw conclusions about populations based on sample data. Hypothesis testing involves evaluating two statements (the null and alternative hypotheses) about a population using sample data. A hypothesis test determines which statement is best supported.
The key steps in hypothesis testing are to formulate the hypotheses, select an appropriate statistical test, choose a significance level, collect and analyze sample data to calculate a test statistic, determine the probability or critical value associated with the test statistic, and make a decision to reject or fail to reject the null hypothesis based on comparing the probability or test statistic to the significance level and critical value.
An example tests whether the proportion of internet users who shop online is greater than 40% using
Hypothesis testing involves developing a null hypothesis (H0) and an alternative hypothesis (Ha) to test a given situation. H0 states there is no difference, while Ha states there is a difference. Tests can be one-tailed or two-tailed. A two-tailed test rejects H0 if the sample mean is significantly different in either direction, while a one-tailed test only rejects if the difference is in the direction specified by Ha. When conducting a test, there is a risk of making a Type I error by rejecting a true H0, or a Type II error by failing to reject a false H0. The significance level determines the probability of a Type I error.
Hypothesis testing involves stating a null hypothesis (H0) and an alternative hypothesis (H1). H0 assumes there is no effect or relationship in the population. H1 states there is an effect. A study is conducted and statistics are used to determine if the data supports rejecting H0 in favor of H1. The p-value indicates the probability of obtaining results as extreme as the observed data or more extreme if H0 is true. If p ≤ the predetermined significance level (α = 0.05), H0 is rejected in favor of H1. Otherwise, H0 is retained but not proven true. Type I and II errors can occur when the true hypothesis is incorrectly rejected or retained.
This document discusses confidence intervals for estimating population parameters. It provides examples of constructing point and interval estimates for the population mean and proportion from sample data. Confidence intervals allow us to estimate a range of plausible values for the true population parameter based on the sample results and desired confidence level, rather than just a single point value. The width of the confidence interval depends on the sample size and confidence level, with larger samples and lower confidence levels producing narrower intervals.
The document discusses hypothesis testing and provides examples to illustrate the process. It explains how to state the research question and hypotheses, set the decision rule, calculate test statistics, decide if results are significant, and interpret the findings. An example tests if narcissistic individuals look in the mirror more often than others and finds they do based on a test statistic exceeding the critical value. A second example finds no significant difference in recovery time for patients with or without social support after surgery.
This document provides instructions for conducting two statistical tests: Spearman's Rank Correlation Coefficient and Chi-Square. Spearman's Rank is used to analyze the relationship between two variables like distance and environmental quality. It involves ranking values, calculating differences between ranks, and using a formula to determine if the relationship is statistically significant. Chi-Square analyzes relationships between categorical variables like opinions and demographics. It involves creating a results table, calculating expected values, applying a formula, and determining statistical significance based on degrees of freedom. Both tests are used to evaluate a null hypothesis of no relationship between variables.
MSL 5080, Methods of Analysis for Business Operations 1 .docxAASTHA76
MSL 5080, Methods of Analysis for Business Operations 1
Course Learning Outcomes for Unit II
Upon completion of this unit, students should be able to:
2. Distinguish between the approaches to determining probability.
Reading Assignment
Chapter 2: Probability Concepts and Applications, pp. 23–32
Unit Lesson
As you know, much in the world happens in amounts we can count—some in discrete numbers (1 item, 2
items, 3 items, never with a fraction of an item) or continuous ones (3.75 hours, 2.433333 hours) that could be
any fraction within a given range. Because of this, one can either calculate or estimate probability, which will
be the focus of this unit.
Probability
Who wants to calculate probability? Businesses (including farmers and ranchers raising crops and livestock),
governments, and anyone wanting to quantify risks in life will calculate probability. This includes people
involved in gaming as well. As you read, the textbook illustrates probability with coin tosses where the
outcomes are just two possible ones—heads or tails (Render, Stair, Hanna, & Hale, 2015). You may know
that gamblers have more complex probability problems to estimate a solution for—as in Texas Hold ‘Em,
where a cardholder may be calculating whether the remaining players are holding higher hands than his or
her own. There are answers available to the cardholder’s dilemma as well.
Probability is “a numerical statement about the likelihood that an event will occur” (Render et al., 2015, p. 24).
Mathematics can model this for us. Because some mathematical terms are equal to others, you can state the
formulas for certain probabilities as you see in Chapter 2 of the textbook. In the physical world, the probability
of anything is either 0 (cannot happen), 1 (100% chance of happening), or some fraction in between 0 and 1
(has a little/some/even/probable chance of happening). As something has to happen in every trial, the
probabilities added up for identical trials equal 1 for the series. A tossed coin has a 50% or .5 chance of
coming up heads, and the same 50% or .5 chance of coming up tails, but something will come up when the
coin is tossed—or, .5 + .5 = 1.
So for probability of the event = P(event):
0 ≤ P(event) ≤ 1
The probability to be calculated is in that range somewhere! Now, how do you find it? Here are two types of
approaches that fit what happens, the objective approach and subjective approach.
Types of Probability
Objective approach: The objective approach (when you can use numbers to calculate probability
directly) uses two common methods (Render et al., 2015):
1. The relative frequency method is used when you know how often things happen (as in the coin
example above, if you know how many times it was tossed), and
2. The classical or logical method is used when you know often things should happen (e.g., the
number of ways a coin will land, heads or tails, without knowing the numbe.
This document introduces probability and discusses different approaches to defining it. It notes that probability is used to describe variability and uncertainty when outcomes are not certain. Three common definitions of probability are discussed - classical, relative frequency, and subjective - along with their limitations. The document advocates treating probability as a mathematical system defined by axioms rather than worrying about numerical values until a specific application. It then outlines how to construct probability models using sample spaces and assigning probabilities to events based on their composition of simple events.
Module-2_Notes-with-Example for data sciencepujashri1975
The document discusses several key concepts in probability and statistics:
- Conditional probability is the probability of one event occurring given that another event has already occurred.
- The binomial distribution models the probability of success in a fixed number of binary experiments. It applies when there are a fixed number of trials, two possible outcomes, and the same probability of success on each trial.
- The normal distribution is a continuous probability distribution that is symmetric and bell-shaped. It is characterized by its mean and standard deviation. Many real-world variables approximate a normal distribution.
- Other concepts discussed include range, interquartile range, variance, and standard deviation. The interquartile range describes the spread of a dataset's middle 50%
The document provides an overview of key statistical concepts including variance, standard deviation, the normal distribution, frequency distributions, data matrices, properties of good graphs, populations and samples, parameters and statistics, hypothesis testing, and point and interval estimation. It defines these terms and explains concepts like the null hypothesis, alternative hypothesis, critical regions, test statistics, and making decisions based on probability thresholds.
The document discusses key statistical concepts including variance, standard deviation, the normal distribution, frequency distributions, data matrices, properties of good graphs, populations and parameters, hypothesis testing, and point and interval estimation. It provides definitions and examples of these terms and how they relate to drawing statistical inferences from data.
1. The document discusses how decision theory and test-taking behavior research can provide insights into psychometrics and test construction.
2. It analyzes factors like the optimal number of response options, the benefits and drawbacks of guessing, and how the framing of scoring rules like penalties vs bonuses can influence test-taker behavior.
3. The author argues that standard psychometric assumptions do not always reflect how test-takers actually analyze situations and make decisions, and that accounting for behavioral factors could improve test quality.
In this slide, variables types, probability theory behind the algorithms and its uses including distribution is explained. Also theorems like bayes theorem is also explained.
1. The document discusses empiricism in philosophy and science, from early thinkers like Newton to modern ones like Taleb. It emphasizes the importance of experimental testing of hypotheses.
2. The document then gives examples of common mistakes made in applying statistics and empiricism to financial modeling and stock selection, such as ignoring data that doesn't fit hypotheses.
3. It analyzes some sample stock portfolio and factor modeling data, finding the models' effectiveness varied over time, highlighting the need for ongoing empirical testing of hypotheses.
This document provides a summary of key concepts in advanced business mathematics and statistics. It defines measures of central tendency including mean, mode, and median. It also discusses measures of dispersion like range and standard deviation. Additionally, it covers topics like regression, hypothesis testing, probability, and different types of statistical analysis.
In the last column we discussed the use of pooling to get a beMalikPinckney86
In the last column we discussed the use of pooling to get a better
estimate of the standard deviation of the measurement method, es-
sentially the standard deviation of the raw data. But as the last column
implied, most of the time individual measurements are averaged and
decisions must take into account another standard deviation, the stan-
dard deviation of the mean, sometimes called the “standard error” of the
mean. It’s helpful to explore this statistic in more detail: fi rst, to under-
stand why statisticians often recommend a “sledgehammer” approach
to data collection methods; and, second, to see that there might be a
better alternative to this crude tactic. We’ll also see how to answer the
question, “How big should my sample size be?”
For the next few columns, we need to discuss in more detail the ways
statisticians do their theoretical work and the ways we use their results.
I often say that theoretical statisticians live on another planet (they don’t,
of course, but let’s say Saturn), while those of us who apply their results
live on Earth. Why do I say that? Because a lot of theoretical statistics
makes the unrealistic assumption that there is an infi nite amount of data
available to us (statisticians call it an infi nite population of data). When we
have to pay for each measurement, that’s a laughable assumption. We’re
often delighted if we have a random sample of that data, perhaps as many
as three replicate measurements from which we can calculate a mean.
That last sentence contains a telling phrase: “a random sample of that
data.” Statisticians imagine that the infi nite population of data contains
all possible values we might get when we make measurements. Statisti-
cians view our results as a random draw from that infi nite population of
possible results that have been sitting there waiting for us. If we were
to make another set of measurements on the same sample, we’d get
a different set of results. That doesn’t surprise the statisticians (and it
shouldn’t surprise us if we adopt their view)—it’s just another random
draw of all the results that are just waiting to appear.
On Saturn they talk about a mean, but they call it a “true” mean. They
don’t intend to imply that they have a pipeline to the National Institute
of Standards and Technology and thus know the absolutely correct value
for what the mean represents. When they call it a “true mean,” they’re
just saying that it’s based on the infi nite amount of data in the popula-
tion, that’s all.
Statisticians generally use Greek letters for true values—μ for a true
mean, σ for a true standard deviation, δ for a true diff erence, etc.
The technical name for these descriptors (μ, σ, δ) is parameters. You’ve
probably been casual about your use of this word, employing it to refer to,
Statistics in the Laboratory:
Standard Deviation of the Mean
say, the pH you’re varying in your experiments, or the yield you get ...
- The document summarizes key concepts from chapters 1.1 to 1.6 of the book "Pattern Recognition and Machine Learning" by Christopher M. Bishop.
- It introduces polynomial curve fitting, Bayesian curve fitting, decision theory, and information theory concepts such as entropy, Kullback-Leibler divergence, and their applications in machine learning.
- Key algorithms covered include linear and polynomial regression, maximum likelihood estimation, and using entropy and KL divergence to model probability distributions.
The document discusses a one-sample t-test used to compare sample data to a standard value. It provides an example comparing intelligence scores of university students to the average score of 100. The sample of 6 students had a mean of 120. Running a one-tailed t-test in SPSS, the results showed the mean score was significantly higher than 100 with t(5)=3.15, p=.02. This allows the inference that the population mean intelligence at the university is greater than the standard score of 100.
The Importance of Probability in Data Science.docxnearlearn
Probability is an essential concept in data science, as it provides the foundation for making informed decisions based on data. Probability theory helps us understand the uncertainty associated with data, and allows us to quantify the likelihood of different outcomes.
The document provides an overview of key statistical concepts including variance, standard deviation, the normal distribution, frequency distributions, data matrices, hypothesis testing, and point and interval estimation. Variance and standard deviation are measures of how dispersed data points are around the mean. The normal distribution is symmetric and bell-shaped. Hypothesis testing involves specifying a null hypothesis, alternative hypothesis, test statistic, decision rule, and critical region to determine whether to reject the null hypothesis. Point and interval estimation aims to estimate population parameters from samples and provide confidence intervals.
This document provides an overview of resources for learning data handling and statistics. It recommends several books, including The Cartoon Guide to Statistics, which uses cartoons and examples to explain statistical concepts in an accessible way. It also lists topics that are important to understand, such as samples vs populations, measures of central tendency and variability, probability, and how to interpret data plots and model fits. The document emphasizes that practicing basic data analysis techniques like plotting, fitting, and interpreting results is essential on a daily basis.
Principles of Health Informatics: Informatics skills - searching and making d...Martin Chapman
This document discusses principles for searching for data and making decisions based on data from an informatics perspective. It covers search strategies for finding specific data within large datasets, using logical inference like deduction, abduction and induction to determine implications of new data, and using Bayes' theorem to update probabilities of outcomes when receiving new data while accounting for prior probabilities and sample sizes. Decision trees are presented as a way to combine multiple probabilities and include preferences through utilities to determine the highest utility decision.
BUS308 – Week 1 Lecture 2 Describing Data Expected Out.docxcurwenmichaela
BUS308 – Week 1 Lecture 2
Describing Data
Expected Outcomes
After reading this lecture, the student should be familiar with:
1. Basic descriptive statistics for data location
2. Basic descriptive statistics for data consistency
3. Basic descriptive statistics for data position
4. Basic approaches for describing likelihood
5. Difference between descriptive and inferential statistics
What this lecture covers
This lecture focuses on describing data and how these descriptions can be used in an
analysis. It also introduces and defines some specific descriptive statistical tools and results.
Even if we never become a data detective or do statistical tests, we will be exposed and
bombarded with statistics and statistical outcomes. We need to understand what they are telling
us and how they help uncover what the data means on the “crime,” AKA research question/issue.
How we obtain these results will be covered in lecture 1-3.
Detecting
In our favorite detective shows, starting out always seems difficult. They have a crime,
but no real clues or suspects, no idea of what happened, no “theory of the crime,” etc. Much as
we are at this point with our question on equal pay for equal work.
The process followed is remarkably similar across the different shows. First, a case or
situation presents itself. The heroes start by understanding the background of the situation and
those involved. They move on to collecting clues and following hints, some of which do not pan
out to be helpful. They then start to build relationships between and among clues and facts,
tossing out ideas that seemed good but lead to dead-ends or non-helpful insights (false leads,
etc.). Finally, a conclusion is reached and the initial question of “who done it” is solved.
Data analysis, and specifically statistical analysis, is done quite the same way as we will
see.
Descriptive Statistics
Week 1 Clues
We are interested in whether or not males and females are paid the same for doing equal
work. So, how do we go about answering this question? The “victim” in this question could be
considered the difference in pay between males and females, specifically when they are doing
equal work. An initial examination (Doc, was it murder or an accident?) involves obtaining
basic information to see if we even have cause to worry.
The first action in any analysis involves collecting the data. This generally involves
conducting a random sample from the population of employees so that we have a manageable
data set to operate from. In this case, our sample, presented in Lecture 1, gave us 25 males and
25 females spread throughout the company. A quick look at the sample by HR provided us with
assurance that the group looked representative of the company workforce we are concerned with
as a whole. Now we can confidently collect clues to see if we should be concerned or not.
As with any detective, the first issue is to understand the.
Everything we see is distributed on some scale. Some people are tall, some short and some are neither tall nor short. Once we find out how many are tall, short or middle heighted we get to know how people are distributed when it comes to height. This distribution can also be of chances. For example, we throw, 100 times, an unbalanced dice and find out how many times 1,2,3,4,5 or 6 appeared on top. This knowledge of distribution plays an important role in empirical work.
1) The document discusses hypothesis testing and statistical inference using examples related to coin tossing. It explains the concepts of type I and type II errors and how hypothesis tests are conducted.
2) An example is provided to test the hypothesis that the average American ideology is somewhat conservative (H0: μ = 5) using data from the National Election Study. The alternative hypothesis is that the average is less than 5 (HA: μ < 5).
3) The results of the hypothesis test show the observed test statistic is lower than the critical value, so the null hypothesis that the average is 5 is rejected in favor of the alternative that the average is less than 5.
Similar to Frequentist inference only seems easy By John Mount (20)
This document summarizes a presentation about accelerating Apache Spark workloads using NVIDIA's RAPIDS accelerator. It notes that global data generation is expected to grow exponentially to 221 zettabytes by 2026. RAPIDS can provide significant speedups and cost savings for Spark workloads by leveraging GPUs. Benchmark results show a NVIDIA decision support benchmark running 5.7x faster and with 4.5x lower costs on a GPU cluster compared to CPU. The document outlines RAPIDS integration with Spark and provides information on qualification, configuration, and future developments.
Talk at SF Big Analytics https://www.meetup.com/sf-big-analytics/events/285731741/
Distributed systems are made up of many components such as authentication, a persistence layer, stateless services, load balancers, and stateful coordination services. These coordination services are central to the operation of the system, performing tasks such as maintaining system configuration state, ensuring service availability, name resolution, and storing other system metadata. Given their central role in the system it is essential that these systems remain available, fault tolerant and consistent. By providing a highly available file system-like abstraction as well as powerful recipes such as leader election, Apache Zookeeper is often used to implement these services. Although powerful, the Zookeeper interface may not be flexible enough or provide sufficient performance for all applications and many systems are replacing Zookeeper based solutions with Raft which provides a more generic interface to high availability and fault tolerance through the use of State Machine replication. This talk will go over a generic example of stateful coordination service moving from Zookeeper to Raft.
Speaker: Tyler Crain ( Alluxio)
Tyler Crain is a software engineer at Alluxio, working on distributed systems within the Alluxio core team. Before this, Tyler held Post-Doc positions at the University of Sydney and Sorbonne Universities where he performed research on topics including distributed key-value stores, distributed consensus and blockchain. Tyler received his PhD from the University of Rennes where he worked on Transactional Memory. He also holds a Masters degree in Computer Science from University of California Santa Barbara.
talk at SF Big Analytics:
Related Blog: https://www.alluxio.io/blog/from-zookeeper-to-raft-how-alluxio-stores-file-system-state-with-high-availability-and-fault-tolerance/
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...Chester Chen
Recent years have witnessed an exponential growth of the model scale in recommendation/Ads/search—from Google’s 2016 model with 1 billion parameters to the latest Facebook’s model with 12 trillion parameters. Significant quality boost has come with each jump of the model capacity, which makes people believe the era of 100 trillion parameters is around the corner. To prepare the exponential growth of the model size, an efficient distributed training system is in urgent need. However, the training of such huge models is challenging even within industrial scale data centers. In this talk, I will introduce Persia -- an open training system developed by my team -- to resolve this challenge by careful co-design of both the optimization algorithm and the distributed system architecture. Persia admits nearly linear speedup properties while scaling the number of workers and the model size. Beside the capability of training 100 trillion parameters, it also shows a clear advantage in efficiency over other open sourced engines.
paper link:
https://arxiv.org/pdf/2111.05897.pdf
Speaker: Ji Liu
Dr. Ji Liu received his Ph.D in computer science and his bachelor degree in automation from University of Wisconsin-Madison and University of Science and Technology of China, respectively. After graduation, he joined the University of Rochester as an assistant professor, conducting research in machine learning, optimization, and reinforcement learning. The developed asynchronous and decentralized algorithms were widely used in industry, such as IBM, Microsoft, etc. He left academia and joined Tencent in 2017, exploring AI’s boundary. The developing AI agent Tstarbot was considered to be a milestone for mastering the most challenging RTS game -- Starcraft II. His second stop in industry is Kwai - the second largest short video company in China. He founded and led multiple international teams with different functionalities: platform team, product team, and research team. His team Contributed to 15+% annual revenue growth in Ads. He published 100+ papers in top-tier CS conferences and journals, and received multiple best paper awards (e.g., SIGKDD 2010 and UAI 2015 Facebook best paper). He was an awardee of MIT TR 35 under 35 in China and IBM faculty award in 2017. He was nominated to be one of China top 5 AI innovators under 35 in 2018
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...Chester Chen
Topic:
NVIDIA FLARE: Federated Learning Application Runtime Environment for Developing Robust AI Models
Summary:
Federated learning (FL) enables building robust and generalizable AI models by leveraging diverse datasets from multiple collaborators without moving data. We created NVIDIA FLARE as an open-source SDK to make it easier for data scientists to use FL in their research. The SDK allows existing machine learning and deep learning workflows adapted for distributed learning across enterprises and enables platform developers to build a secure, privacy-preserving offering for multiparty collaboration utilizing homomorphic encryption or differential privacy. The SDK is a lightweight, flexible, and scalable Python package and allows researchers to bring their data science workflows implemented in any training libraries (PyTorch, TensorFlow, or even NumPy), and apply them in real-world FL settings. This talk will introduce the key design principles of NVIDIA FLARE and illustrate use cases (e.g., COVID analysis) with customizable FL workflows that implement different privacy-preserving algorithms.
Speaker: Dr. Holger Roth ( Nvidia)
Holger Roth is a Sr. Applied Research Scientist at NVIDIA focusing on deep learning for medical imaging. He has been working closely with clinicians and academics over the past several years to develop deep learning based medical image computing and computer-aided detection models for radiological applications. He is an Associate Editor for IEEE Transactions of Medical Imaging and holds a Ph.D. from University College London, UK. In 2018, he was awarded the MICCAI Young Scientist Publication Impact Award.
A missing link in the ML infrastructure stack?Chester Chen
Talk at SF Big Analytics
Machine learning is quickly becoming a product engineering discipline. Although several new categories of infrastructure and tools have emerged to help teams turn their models into production systems, doing so is still extremely challenging for most companies. In this talk, we survey the tooling landscape and point out several parts of the machine learning lifecycle that are still underserved. We propose a new category of tool that could help alleviate these challenges and connect the fragmented production ML tooling ecosystem. We conclude by discussing similarities and differences between our proposed system and those of a few top companies.
Bio: Josh Tobin is the founder and CEO of a stealth machine learning startup. Previously, Josh worked as a deep learning & robotics researcher at OpenAI and as a management consultant at McKinsey. He is also the creator of Full Stack Deep Learning (fullstackdeeplearning.com), the first course focused on the emerging engineering discipline of production machine learning. Josh did his PhD in Computer Science at UC Berkeley advised by Pieter Abbeel.
This document discusses the challenges of data discovery and management from the perspectives of frustrated data scientists and project managers. It explores three main problems with obtaining and working with data. While buying a solution was considered, there were also risks to mitigate. The document asks about the biggest flaw of Artifact and what is next for the company. It concludes by thanking the reader.
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
Uber developed an new Spark ingestion system, Marmaray, for data ingestion from various sources. It’s designed to ingest billions of Kafka messages every 30 minutes. The amount of data handled by the pipeline is of the order hundreds of TBs. Omar details how to tackle such scale and insights into the optimizations techniques. Some key highlights are how to understand bottlenecks in Spark applications, to cache or not to cache your Spark DAG to avoid rereading your input data, how to effectively use accumulators to avoid unnecessary Spark actions, how to inspect your heap and nonheap memory usage across hundreds of executors, how you can change the layout of data to save long-term storage cost, how to effectively use serializers and compression to save network and disk traffic, and how to reduce amortize the cost of your application by multiplexing your jobs, different techniques for reducing memory footprint, runtime, and on-disk usage. CGI was able to significantly (~10%–40%) reduce memory footprint, runtime, and disk usage.
Speaker: Omkar Joshi (Uber)
Omkar Joshi is a senior software engineer on Uber’s Hadoop platform team, where he’s architecting Marmaray. Previously, he led object store and NFS solutions at Hedvig and was an initial contributor to Hadoop’s YARN scheduler.
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...Chester Chen
Uncovering performance regressions in the TCP SACKs vulnerability fixes
In early July 2019, Databricks noticed some Apache Spark workloads regressing by as much as 6x. In this talk, we'll discuss how we traced these regressions back to the Linux kernel and the fixes for the TCP SACKs vulnerabilities. We will explain the symptoms we were seeing, walk through how we debugged the TCP connections, and dive into the Linux source to uncover the root cause.
Speaker: Chris Stevens (Databricks)
Chris Stevens is a software engineer at Databricks where he works on the reliability, scalability, and security of Apache Spark clusters. His work focuses on auto-scaling compute, auto-scaling storage, node initialization performance, and node health monitoring. Prior to Databricks, Chris founded the Minoca OS project, where he built a POSIX compliant, general purpose OS - from scratch - to run on resource constrained device. He got his start at Microsoft working on the Windows kernel team, porting the Windows boot environment from BIOS to UEFI.
SFBigAnalytics_20190724: Monitor kafka like a ProChester Chen
Kafka operators need to provide guarantees to the business that Kafka is working properly and delivering data in real time, and they need to identify and triage problems so they can solve them before end users notice them. This elevates the importance of Kafka monitoring from a nice-to-have to an operational necessity. In this talk, Kafka operations experts Xavier Léauté and Gwen Shapira share their best practices for monitoring Kafka and the streams of events flowing through it. How to detect duplicates, catch buggy clients, and triage performance issues – in short, how to keep the business’s central nervous system healthy and humming along, all like a Kafka pro.
Speakers: Gwen Shapira, Xavier Leaute (Confluence)
Gwen is a software engineer at Confluent working on core Apache Kafka. She has 15 years of experience working with code and customers to build scalable data architectures. She currently specializes in building real-time reliable data processing pipelines using Apache Kafka. Gwen is an author of “Kafka - the Definitive Guide”, "Hadoop Application Architectures", and a frequent presenter at industry conferences. Gwen is also a committer on the Apache Kafka and Apache Sqoop projects.
Xavier Leaute is One of the first engineers to Confluent team, Xavier is responsible for analytics infrastructure, including real-time analytics in KafkaStreams. He was previously a quantitative researcher at BlackRock. Prior to that, he held various research and analytics roles at Barclays Global Investors and MSCI.
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleChester Chen
Talk 2. Managing Uber’s Data workflow at Scale.
Uber microservices serving millions of rides a day, leading to 100+ PB of data. To democratize data pipelines, Uber needed a central tool that provides a way to author, manage, schedule, and deploy data workflows at scale. This talk details Uber’s journey toward a unified and scalable data workflow system used to manage this data and shares the challenges faced and how the company has rearchitected several components of the system—such as scheduling and serialization—to make them highly available and more scalable.
Speaker Alex Kira (Uber)
Alex Kira is an engineering tech lead at Uber, where he works on the data workflow management team. His team provides a data infrastructure platform. In 19-year, he’s had experience across several software disciplines, including distributed systems, data infrastructure, and full stack development.
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
Building highly efficient data lakes using Apache Hudi (Incubating)
Even with the exponential growth in data volumes, ingesting/storing/managing big data remains unstandardized & in-efficient. Data lakes are a common architectural pattern to organize big data and democratize access to the organization. In this talk, we will discuss different aspects of building honest data lake architectures, pin pointing technical challenges and areas of inefficiency. We will then re-architect the data lake using Apache Hudi (Incubating), which provides streaming primitives right on top of big data. We will show how upserts & incremental change streams provided by Hudi help optimize data ingestion and ETL processing. Further, Apache Hudi manages growth, sizes files of the resulting data lake using purely open-source file formats, also providing for optimized query performance & file system listing. We will also provide hands-on tools and guides for trying this out on your own data lake.
Speaker: Vinoth Chandar (Uber)
Vinoth is Technical Lead at Uber Data Infrastructure Team
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
Talk 1. Scaling Apache Spark on Kubernetes at Lyft
As part of this mission Lyft invests heavily in open source infrastructure and tooling. At Lyft Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark at Lyft has evolved to solve both Machine Learning and large scale ETL workloads. By combining the flexibility of Kubernetes with the data processing power of Apache Spark, Lyft is able to drive ETL data processing to a different level. In this talk, We will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Topics Include: - Key traits of Apache Spark on Kubernetes. - Deep dive into Lyft's multi-cluster setup and operationality to handle petabytes of production data. - How Lyft extends and enhances Apache Spark to support capabilities such as Spark pod life cycle metrics and state management, resource prioritization, and queuing and throttling. - Dynamic job scale estimation and runtime dynamic job configuration. - How Lyft powers internal Data Scientists, Business Analysts, and Data Engineers via a multi-cluster setup.
Speaker: Li Gao
Li Gao is the tech lead in the cloud native spark compute initiative at Lyft. Prior to Lyft, Li worked at Salesforce, Fitbit, Marin Software, and a few startups etc. on various technical leadership positions on cloud native and hybrid cloud data platforms at scale. Besides Spark, Li has scaled and productionized other open source projects, such as Presto, Apache HBase, Apache Phoenix, Apache Kafka, Apache Airflow, Apache Hive, and Apache Cassandra.
SFBigAnalytics- hybrid data management using cdapChester Chen
Cloud has emerged as a critical enabler of digital transformation, with the aim of reducing IT overheads and costs. However, cloud
migration is not instantaneous for a variety of reasons including data sensitivity, compliance and application performance. This results in the creation of diverse hybrid and multi-cloud environments and amplifies data management and integration challenges. This talk demonstrates how CDAP’s flexibility can allow you to utilize your existing on-premises infrastructure, as you evolve to the latest Big Data and Cloud services at your own pace, all while providing you a single, unified view of all your data, wherever it resides.
Speaker: Bhooshan Mogal, Google
Bhooshan Mogal is a Product Manager at Google, where he is focused on delivering best-in-class Data and Analytics services to GCP users. Prior to Google, he worked on data systems at Cask Data Inc, Pivotal and Yahoo.
Bighead: Airbnb's end-to-end machine learning platform
Airbnb has a wide variety of ML problems ranging from models on traditional structured data to models built on unstructured data such as user reviews, messages and listing images. The ability to build, iterate on, and maintain healthy machine learning models is critical to Airbnb’s success. Bighead aims to tie together various open source and in-house projects to remove incidental complexity from ML workflows. Bighead is built on Python, Spark, and Kubernetes. The components include a lifecycle management service, an offline training and inference engine, an online inference service, a prototyping environment, and a Docker image customization tool. Each component can be used individually. In addition, Bighead includes a unified model building API that smoothly integrates popular libraries including TensorFlow, XGBoost, and PyTorch. Each model is reproducible and iterable through standardization of data collection and transformation, model training environments, and production deployment. This talk covers the architecture, the problems that each individual component and the overall system aims to solve, and a vision for the future of machine learning infrastructure. It’s widely adopted in Airbnb and we have variety of models running in production. We plan to open source Bighead to allow the wider community to benefit from our work.
Speaker: Andrew Hoh
Andrew Hoh is the Product Manager for the ML Infrastructure and Applied ML teams at Airbnb. Previously, he has spent time building and growing Microsoft Azure's NoSQL distributed database. He holds a degree in computer science from Dartmouth College.
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformChester Chen
Talk 1 : Evolution of the GoPro's data platform
In this talk, we will share GoPro’s experiences in building Data Analytics Cluster in Cloud. We will discuss: evolution of data platform from fixed-size Hadoop clusters to Cloud-based Spark Cluster with Centralized Hive Metastore +S3: Cost Benefits and DevOp Impact; Configurable, spark-based batch Ingestion/ETL framework;
Migration Streaming framework to Cloud + S3;
Analytics metrics delivery with Slack integration;
BedRock: Data Platform Management, Visualization & Self-Service Portal
Visualizing Machine learning Features via Google Facets + Spark
Speakers: Chester Chen
Chester Chen is the Head of Data Science & Engineering, GoPro. Previously, he was the Director of Engineering at Alpine Data Lab.
David Winters
David is an Architect in the Data Science and Engineering team at GoPro and the creator of their Spark-Kafka data ingestion pipeline. Previously He worked at Apple & Splice Machines.
Hao Zou
Hao is a Senior big data engineer at Data Science and Engineering team. Previously He worked as Alpine Data Labs and Pivotal
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen
GoPro’s camera, drone, mobile devices as well as web, desktop applications are generating billions of event logs. The analytics metrics and insights that inform product, engineering, and marketing team decisions need to be distributed quickly and efficiently. We need to visualize the metrics to find the trends or anomalies.
While trying to building up the features store for machine learning, we need to visualize the features, Google Facets is an excellent project for visualizing features. But can we visualize larger feature dataset?
These are issues we encounter at GoPro as part of the data platform evolution. In this talk, we will discuss few of the progress we made at GoPro. We will talk about how to use Slack + Plot.ly to delivery analytics metrics and visualization. And we will also discuss our work to visualize large feature set using Google Facets with Apache Spark.
Spark can be enhanced with data warehouse capabilities to leverage both open source analytics and enterprise data warehouse strengths. This includes incorporating star schema detection and referential integrity constraints to optimize queries. Performance can be improved by pushing down operations like joins, filters, and projections from Spark to underlying data sources using heuristics like star schema patterns. Push downs allow exploiting database indexes and reducing data transfer. Star schema detection and join push downs have shown speedups of 2-31x on TPC-DS benchmark queries.
This document summarizes new features in Apache Spark 2.3, including continuous processing mode for structured streaming, stream-stream joins, running Spark applications on Kubernetes, improved PySpark performance through vectorized UDFs and Pandas integration, and Databricks Delta for reliability and performance in data lakes. The author, an Apache Spark committer and PMC member, provides overviews and code examples of these features.
The document discusses building an enterprise/cloud analytics platform using Jupyter notebooks and Apache Spark. It describes the challenges of deploying Jupyter notebooks at an enterprise scale, including collaboration, large-scale data analysis, security, and authentication. It outlines various approaches taken to address these challenges, such as running the entire Jupyter stack on a single large machine or giving each user their own container. However, these approaches have limitations. The document then introduces the Jupyter Enterprise Gateway as a solution developed by IBM to optimize resource allocation, support multi-users securely through impersonation, and enhance security overall when deploying Jupyter at an enterprise scale.
The document summarizes new features and improvements in Apache Spark 2.3 for machine learning. Key highlights include first-class support for loading image data, enhanced scalability of feature transformers by supporting multiple columns, parallelizing cross-validation for faster hyperparameter tuning, and a new scalable feature hashing transformer. Performance tests demonstrate that the multi-column transformers provide up to 2.7x speedup over the single-column approach. Parallel cross-validation also provides a 2-2.7x speedup using 3 threads. Future areas of focus include completing multi-column support, improving Python APIs, and enhancing techniques like gradient boosted trees.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Frequentist inference only seems easy By John Mount
1. Frequentist estimation only
seems easy
John Mount
Win-Vector LLC
1
Outline
First example problem: estimating the success rate of coin flips.
Second example problem: estimating the success rate of a dice
game.
Interspersed in both: an entomologist’s view of lots of heavy
calculation.
Image from “HOW TO PIN AND LABEL ADULT INSECTS”
Bambara, Blinn, http://www.ces.ncsu.edu/depts/ent/notes/
4H/insect_pinning4a.html
2 This talk is going to alternate between simple probability games
(like rolling dice) and the detailed calculations needed to bring
the reasoning forward. If you come away with two points from
this talk remember: classic frequentist statistics is not as cut and
dried as teacher claim (so it is okay to ask questions), and
Bayesian statistics is not nearly as complicated as people make it
appear.
The point of this talk
Statistics is a polished field where many of the foundations are no
longer discussed.
A lot of the “math anxiety” felt in learning statistics is from uncertainty
about these foundations, and how they actually lead to common
practices.
We are going to discuss common simple statistical goals (correct
models, unbiasedness, low error) and how the lead to common simple
statistical procedures.
The surprises (at least for me) are:
There is more than one way to do things.
The calculations needed to justify how even simple procedures
are derived from the goals are in fact pretty involved.
3 A lot of the pain of learning is being told there is only “one
way” (when there is more than one) and that a hard step (linking
goals to procedures) is easy (when in fact is is hard). Statistics
would be easier to teach if those two things were true, but they
are not. However, not addressing these issues makes learning
statistics harder than it has to be. We are going to spend some
time on what are appropriate statistical goals, and how they lead
to common statistical procedures (instead of claiming everything
is obvious). You won’t be expected do invent the math, but you
need to accept that it is in fact hard to justify common statistical
procedures without somebody having already done the math.
And I’ll be honest I am a math for math’s sake guy.
2. What you will get from this
presentation
Simple puzzles that present problems for the common rules of estimating rates.
Good for countering somebody who says “everything is easy and you just
don’t get it.”
Examples that expose strong consequences of the seemingly subtle differences
in common statistical estimation methods.
Makes understanding seemingly esoteric distinctions like Bayesianism and
frequentism much easier.
A taste of some of the really neat math used to establish common statistics.
A revival of Wald game-theoretic style inference (as described in Savage “The
Foundations of Statistics”).
4 You will get to roll the die, and we won’t make you do the heavy
math. Aside: we have been telling people that one of the things
that makes data science easy is large data sets allow you to avoid
some of the hard math in small sample size problems. Here we
work through some of the math. In practice you do get small
sample size issues even in large data sets due to heavy-tail like
phenomena and when you introducing conditioning and
segmentation (themselves typical modeling steps).
First example: coin flip game
5
Why do we even care?
The coin problem is a stand-in for something that that is probably
important to us: such as estimating the probability of a sale given
features and past experience: P[ sale | features,evidence ].
Being able to efficiently form good estimates that combine domain
knowledge, current features and past data is the ultimate goal of
analytics/data-science.
6
3. The coin problem
You are watching flips of a coin and want to estimate the probability
p that the coin comes up heads.
For example: "T" "T" "H" "T" "H" "T" "H" "H" "T" “T"
Easy to apply!
Sufficient statistic: 4 heads, 6 tails
Frequentist estimate of p: p ~ heads/(heads+tails) = 0.4
Done. Thanks for your time.
7 # R code
set.seed(2014)
sample = rbinom(10,1,0.5)
print(ifelse(sample>0.5,'H','T'))
Wait, how did we know to do
that?
Why is it obvious h/(h+t) is the best estimate of the unknown true
value of p?
8 Fundamental problem: a mid-range probability prediction (say a
number in the range 1/6 to 5/6) is not falsifiable by a single
experiment. So: how do we know such statements actually have
empirical content? The usual answers are performance on long
sequences (frequentist), appeals to axioms of probability
(essentially additivity of disjoint events), and subjective
interpretations. Each view has some assumptions and takes
some work.
Checking whether a coin is fair - Wikipedia, the free encyclopedia 7/21/14, 12:26 PM
is small when compared with the alternative hypothesis (a biased coin). However, it is not small enough to
cause us to believe that the coin has a significant bias. Notice that this probability is slightly higher than our
presupposition of the probability that the coin was fair corresponding to the uniform prior distribution, which
was 10%. Using a prior distribution that reflects our prior knowledge of what a coin is and how it acts, the
posterior distribution would not favor the hypothesis of bias. However the number of trials in this example (10
tosses) is very small, and with more trials the choice of prior
distribution would be somewhat less relevant.)
Note that, with the uniform prior, the posterior probability
distribution f(r | H = 7,T = 3) achieves its peak at
r = h / (h + t) = 0.7; this value is called the maximum a
posteriori (MAP) estimate of r. Also with the uniform prior,
the expected value of r under the posterior distribution is
The standard easy estimate
comes from frequentism
Plot of the probability density f(x | H = 7,T = 3) =
1320 x7 (1 - x)3 with x ranging from 0 to 1.
The standard answer (this example from http://en.wikipedia.org/
wiki/Checking_whether_a_coin_is_fair ):
Estimator of true probability
The best estimator for the actual value is the estimator .
This estimator has a margin of error (E) where at a particular confidence level.
Answer is correct and simple, but not good (as it lacks context,
assumptions, goals, motivation and explanation).
Stumper: without an appeal to authority how do we know to use the
estimate of heads/(heads+tails). What problem is such an estimate
solving (what criterion is it optimizing)?
Using this approach, to decide the number of times the coin should be tossed, two parameters are required:
1. The confidence level which is denoted by confidence interval (Z)
2. The maximum (acceptable) error (E)
The confidence level is denoted by Z and is given by the Z-value of a standard normal distribution. This
9 Notation is a bit different: here tau is the unknown true value and
http://en.wikipedia.org/wiki/Checking_whether_a_coin_is_fair Page 4 of 8
p is the estimate. Throughout this talk by “coin” we mean an
abstract device that always returns one of two states. For Gelman
and Nolan have an interesting article “You Can Load a Die, But
You Can’t Bias a Coin” http://www.stat.columbia.edu/~gelman/
research/published/diceRev2.pdf about how hard it would be to
bias an actual coin that you allow somebody else to flip (and how
useless articles testing the fairness of the new Euro were).
4. Also, there are other common
estimates
Examples:
A priori belief: p ~ 0.5 regardless of evidence.
Bayesian (Jeffreys prior) estimate: p ~ (heads+0.5)/(heads+tails
+1) = 0.4090909
Laplace smoothed estimate: p ~ (heads+1)/(heads+tails+2) =
0.4166667
Game theory minimax estimates (more on this later in this talk).
The classic frequentist estimate is not the only acceptable estimate.
10 Each of these has its merits. A prior belief has the least sampling
noise (as it ignores the data). Bayesian with Jeffreys prior very
roughly tries to maximize the amount of information captured in
the first observation. Laplace smoothing minimizes expected
square error under a uniform prior.
Each different estimate has its
own characteristic justification
From “The Cartoon Guide to Statistics”
Gonick and Smith.
11 If all of the estimates where “fully compatible” with each other
then they would all be identical. Which they clearly are not.
Notice we are discussing difference in estimates here- not
differences in significances or hypothesis tests. Also Bayesian
priors are not always subjective beliefs (Wald in particular used
an operational definition).
The standard story
There are 1 to 2 ways to do statistics: frequentism and maybe
Bayesianism.
In frequentist estimation the unknown quantity to be estimated is fixed
at a single value and the experiment is considered a repeatable event
(with different possible measurements on each possible).
All probabilities are over possible repetitions of experiment with
observations changing.
In Bayesian estimation the unknown quantity to be estimated is
assumed to have non-trivial distribution and the experimental results
are considered fixed.
All probabilities are over possible values of the quantity to be
estimated. Priors talk about the assumed distribution before
measurement, posteriors talk about the distribution conditioned on
the measurements.
12 There are other differences: such as preference of point-wise
estimates versus full descriptions of distribution. And these are
not the only possible models.
5. Our coin example again
I flip a coin a single time and it comes up heads- what is my best
estimate of the probability the coin comes up heads in repeated
flips?
“Classic”/naive probability: 0.5 (independent of observations/
data)
Frequentist: 1.0
Bayesian (Jeffreys prior): 0.75
13 Laws that are correct are correct in the extreme cases. (if we
have distributed 6-sided dice) Lets try this. Everybody roll your
die. If it comes up odd you win and even you lose. Okay
somebody who one raise your hand. Each one of you if purely
frequentist estimates 100% chance of winning this game (if you
stick only to data from your die). Now please put your hands
down. Everybody who did not win, how do you feel about the
estimate of 100% chance of winning?
What is the frequentist estimate
optimizing?
"Bayesian Data Analysis" 3rd Edition,Gelman, Carlin, Stern,
Dunson, Vehtari, Rubin p. 92 states that frequentist estimates are
designed to be consistent (as the sample size increases they
converge to the unknown value), efficient (they tend to minimize
loss or expected square-error), or even have asymptotic
unbiasedness (the difference in the estimate from the true value
converges to zero as the experiment size increases, even when re-scaled
by the shrinking standard error of the estimate).
If we think about it: frequentism is interpreting probabilities as limits
of rates of repeated experiments. In this form bias is an especially
bad form of error as it doesn’t average out.
14 Why not minimize L1 error? Because this doesn’t always turn out
to be unbiased (or isn’t always a regression).
Bayesians can allow bias. The saving idea: is don’t average
estimators, but aggregate data and form a new estimate.
Frequentist concerns: bias and
efficiency (variance)
From:“The Cambridge Dictionary of Statistics” 2nd Edition, B.S. Everitt.
Bias:
An estimator for which E[ˆ✓] = ✓ is said to be unbiased.
Efficiency:
A term applied in the context of comparing di↵erent methods of estimating
the same parameter; the estimate with the lowest variance being regarded as
the most efficient.
15 There is more than one unbiased estimate. For example a grand
average (unconditioned by features) is an unbiased estimate.
6. A good motivation of the
frequentist estimate
Adapted from “Schaum’s Outlines Statistics” 4th Edition, Spiegel, Stephens,
pp. 204-205.
SAMPLING DISTRIBUTIONS OF MEANS
Suppose that all possible samples of size N are drawn without replacement
from a finite population of size Np > N. If we denote the mean and stan-dard
deviation of the sampling distribution of means by E[ˆμ] and E[ˆ] and the
population mean and standard deviation by μ and respectively, then
E[ˆμ] = μ and E[ˆ] =
pN
s
Np − N
Np − 1
(1)
If the population is infinite or if sampling is with replacement, the above
results reduce to
E[ˆμ] = μ and E[ˆ] =
pN
(2)
SAMPLING DISTRIBUTIONS OF PROPORTIONS
Suppose that a population is infinite and the probability of occurrence of
an event (called its success) is p. ... We thus obtain a sampling distribution of
proportions whose mean E[ˆp] and standard deviation E[ˆ] are given by
E[ˆp] = p and E[ˆ] =
p(1 − p)
pN
(3)
16 A very good explanation. Unbiased views of the unknown
parameter and its variance are directly observable in the
sampling distribution. So you copy the observed values as your
estimates. But to our point: frequentism no longer seems so
simple. Also close the Bayesian justification: build a complete
generative model with complete priors: and then you can copy
averages of what you observe.
Why is the frequentist forced to
use the estimates 0 and 1?
If the frequentist estimate is to be unbiased for any unknown value
of p (in the range 0 through 1) then we must have for each such p:
Xn
h=0
P[h|n, p]en,h =
Xn
h=0
✓
n
h
◆
ph(1 p)nhen,h = p
The frequentist estimate for each possible outcome of seeing h-heads
in n-flips is a simultaneous planned panel of estimates e(n,h)
that must satisfy the above bias-check equations for all p.
These check conditions tend to be independent linear equations
over our planned estimates e(n,h). So the system has at most one
solution and it turns at the solution e(n,h) = h/n works.
Insisting on unbiasedness completely determines the solution.
17 Estimates like 0 and 1 are wasteful in the sense they allow only
one-sided errors. Laplace “add one smoothing” puts estimates
between likely values (lowering expected l2 error under uniform
priors).
!T
he check equations tend to be full rank linear equations in
e(n,h) as the p-s generate something very much like the moment
curve (which itself is a parameterized curve generating sets of
points in general position).
!T
he reason I am showing this is: usually frequentist inference is
described as canned procedures (avoiding trigger math anxiety)
and Bayesian methods are presented as complicated formulas. In
fact you should be as uncomfortable with frequentist methods as
you are with Bayesian methods.
!
sum_{h=0}^{n} text{P}[h|n,p] e_{n,h} = sum_{h=0}^{n} { n
choose h} p^h (1-p)^{n-h} e_{n,h} = p
Argh! That is a lot of painful math.
The math (turning reasonable desiderata to reasonable procedures) has
always been hiding there.
You never need to re-do the math to use the use classic frequentist
inference procedures (just to derive them).
18 We really worked to h/(h+t) the hard way. The frequentist can’t
generate an estimate a single outcome, they must submit a panel
of estimates for every possible outcome and then check that the
panel represents a schedule of estimates that are simultaneously
unbiased for any possible p.
7. Is the frequentist solution
optimal?
It is the only unbiased solution. So it is certainly the most efficient unbiased
solution.
What if we relaxed unbiasedness? Are there more efficient solutions?
Yes: consider estimates e(1,h) = (0,1) and b(1,h) = (1/4,3/4)
Suppose loss is: loss(f,n) = E[ E[(f(n,h)-p)^2 | h ~ p,n] | p ~ P[p] ]
P[p] is an assumed prior probability on p, such as P[p] = 1/3 if p=0,1/2,1
and 0 otherwise.
Then: loss(1,b) = 0.0625 and loss(1,e) = 0.25. So loss(1,b) loss(1,e),
you can think of the Bayesian procedures as being more efficient.
But that isn’t fair. Insisting on a prior is adopting the Bayesian’s
assumptions as truth. Of course that makes them look better.
19
Frequentist response: you can’t
just wish-away bias
Let’s try this lower-loss Bayesian estimate b(1,h) = (0.25,0.75)
Suppose we 50 dice and we record wins as 1, losses as 0.
Suppose in the above experiment there were 50 of us an 8 people won.
Averaging the frequentist estimates: (8*1.0 + 42*0.0)/50 = 0.16 (not
too far from the true value 1/6 = 0.1666667).
Averaging the “improved” Bayesian estimates: (8*0.75 + 42*0.25)/50
= 0.33. Way off and most of the error is bias (not mere sampling
error).
Bayesian response: you don’t average estimates, you aggregate data
and re-estimate. So you treat the group as a single experiment with 8
wins and 42 losses. Estimate is then (8+0.5)/(50+1) = 0.1666667 (no
reason for estimate to be dead on, Bayesians got lucky that time).
20 (if they have dice they can run with this, all roll- count and compute).
Bayesian response: you don’t average individual estimators, ! you collect
# R code
set.seed(2014)
sample = rbinom(50,1,1/6)
sum(sample)/length(sample)
[1] 0.16
sum(ifelse(sample0.5,0.75,0.25))/length(sample)
[1] 0.33
(0.5+sum(sample))/(1+length(sample))
[1] 0.1666667
Second example: dice game
21
8. Dice are a fun example
22 Dice are pretty much designed to obey the axioms of naive/
classical probability theory (indivisible events having equal
probability). Also once you have a lot of dice it is easy to think in
terms of exchangeable repetitions of experiments (frequentist).
Given that you will forgive us if we tilt the game towards the
Bayesians by adding some hidden state.
The dice game
A control die numbered 1 through 5 is either rolled or placed on one
of its sides.
The game die is a fair die numbered 1 through 6. When the game
die is rolled the game is a win if the number shown on the game die
is greater than the number shown on the control die.
The control die is held at the same value even when we re-roll the
game die.
Neither of the dice is ever seen by the player.
23
You only see the win/lose state
not the control die or the game die
24 (if we have distributed 6-sided dice) Let’s play a round of this.
I’ll hold the control die at 3. You all roll your 6-sided die. Okay
everybody who’s die exceeded 1 raise their hands. This time we
will group our observations to estimate the “unknown”
probability p of winning. What we are looking for is that close to
half the room (assuming we have enough people to build a large
sample, and that we don’t get incredibly unlucky) have raised
their hand. From this you should be able to surmise their are
good odds the control die is set at 3, even if you don’t remember
what you saw on the control die or what was on your game die.
9. Multiple plays
The control die is held at a single value and you try to learn the
odds by observing the wins/losses reported by repeated rolls of the
game die (but not ever seeing either of the dice).
25
The empirical frequentist
procedure seems off
After first flip you are forced (by the bias check conditions) to
estimate a win-rate of 0 or 1. The with rate is always one of 1/6,
2/6, 3/6, 4/6, or 5/6. So your first estimate is always out of range.
After 5 flips the bias equations no longer determine a unique
solution. So you can try to decrease variance without adding any
bias. But since your solution is no longer unique, you should have
less faith it is the one true solution.
26 Could try Winsorising and using 1/6 as our estimate if we lose
and 5/6 as our estimate if we win. But we saw earlier that
“tucking in” estimates doesn’t always help (it introduces a bad
bias).
How about other estimates?
Can we find an estimator that uses criteria other than unbiasedness
without the strong assumption of knowing a favorable prior
distribution?
Remember: if we assume a prior distribution (even a so-called
uninformative prior) and the assumption turns out to be very far
off, then our estimate may be very far off (at least until we have
enough data to dominate the priors).
How about a solution that does well for the worst possible selection
of the unknown probability p?
We are not assuming a distribution on p, just that it is picked to
be worst possible for our strategy.
27
10. Leads us to a game theory
minimax solution
We want an estimate f(n,h) such that:
Where loss(u,v) = (u-v)^2 or loss(u,v) = |u-v|. Here the opponent is
submitting a vector p of probabilities of setting the control die to
each of its 5 marks. The standard game-theory way to solve this is
to find a f(n,h) that works well against the opponent picking a single
state of the control die (c) after they see our complete set of
estimates. That is:
28 In practice would just use Bayesian methods with reasonable
priors. The reduction of one very hard form to another slightly
less-hard problem is the core theorem of game theory. Even if
you have been taught not to fear long equations, these should
look nasty (as they have a lot of quantifiers in them and
quantifiers can rapidly increase complexity).
f(n,h) is just a panel or vector of n+1 estimate choices for each n.
Also once you have things down to simple minimization you
essentially have a problem of designing numerical integration or
optimal quadrature.
!f(n,h) = text{argmin}_{f(n,h)} max_{p in mathbb{R}^{5}, p ge
0, 1 cdot p = 1}sum_{c=1}^{5} p_c sum_{k=0}^{n} text{P}[k
text{ wins} | n,p_c] times text{loss}( f(n,k) ,frac{6-c}{6} )
f(n,h) = text{argmin}_{f(n,h)} max_{c in {1,cdots , 5}}
sum_{k=0}^{n} text{P}[k text{ wins} | n,p_c] times text{loss}
( f(n,k) ,frac{6-c}{6} )
Wald already systematized this
29 If you believe the control die is set by a fair roll, then we again
have a game designed to exactly match a specific generative
model (i.e. designed for Bayesian methods to win). If you believe
the die is set by an adversary, you again have a game theory
problem. Player 1 is trying to maximize risk/loss/error and
player 2 is trying to minimize risk. We model the game as both
players submitting their strategies at the same time. The
standard game theory solution is you pick a strategy so strong
that you would do no worse if your opponent peaked at it and
then altered their strategy. This is part of a minimax setup.
!W
ald, A. (1949). Statistical Decision Functions. Ann. Math.
Statist., 20(2):165–205.
Wald was very smart
One of his WWII ideas: armor sections of combat planes that you
never saw damaged on returning planes. Classical thinking: put
armor where you see bullet holes. Wald: put armor where you
have never seen a bullet hole (hence never seen a hit survived).
30 Wald could bring a lot of deep math to the table. Wald’s solution
allows for many different choices of loss (not just variance or L2)
and for probabilistic estimates (i.e. don’t have to return the same
estimate every time you see the same evidence, though that isn’t
really and advantage).
11. Our game
In both cases the loss function is convex, so we expect a unique
connected set of globally optimal solutions (no isolated local
minima).
For the l1-loss case where loss(u,v) = |u-v| we can solve for the
optimal f(n,k) by a linear program.
1-round l1 solution [0.3, 0.7]
2-round l1 solution [0.24, 0.5, 0.76]
For the l2-loss case where loss(u,v) = (u-v)^2 we can solve for
the optimal f(n,k) using Newton’s method.
1-round l2 solution [0.25, 0.75]
2-round l2 solution [0.21, 0.5, 0.79]
31 These solutions are profitably exploiting both the boundedness
of p (in the range 1/6 through 5/6) and the fact that p only takes
one of 5 possible values (though we obviously don’t know which).
!How do we pick between l1 and l2 loss? l2 is traditional as it is
the next natural moment after the first moment (which becomes
the bias conditions). Without the bias conditions l1 loss plausible
(and leads to things like quantile regression). l2 has some
advantages (such as the gradient structure tending to get
expectations right, hence helping enforce regression conditions
and reduce bias).
Another game
Suppose the opponent can pick any probability for a coin (they are
not limited to 1/6,2/6,3/6,4/6,5/6).
In this case we want to pick f(n,h) minimizing:
32 M(n,f(n,h)) = max_{p in [0,1]} sum_{k=0}^{n} text{P}[k
text{ wins} | n,p] times text{loss}( f(n,k) ,p )
The general p l2 minimax
solutions
For the l1-loss case where loss(u,v) = |u-v| we have a convex
program with a different linear constraint for each possible p. A
column generating strategy over a LP solver handles this quite
nicely.
For the l2-loss case where loss(u,v) = (u-v)^2 the solution is:
heads + pheads + tails/2
heads + tails + pheads + tails
33 Savage, L. J. (1972). The Foundations of Statistics. Dover cites
this solution as coming from Hodges, J. L., J. and Lehmann, E. L.
(1950). Some problems in minimax point estimation. The Annals
o!f Mathematical Statistics, 21(2):pp. 182–197.
see http://winvector.github.io/freq/minimax.pdf for details
!
frac{text{heads} + sqrt{text{heads} + text{tails}}/2}
{text{heads} + text{tails} + sqrt{text{heads}+text{tails}}}
12. How can you solve the l2
minimax problem?
Define:
L(n, f(n, h), p) =
Xn
k=0
P[k wins|n, p] ⇥ (f(n, k) − p)2
For every n there is a f(n,h) (essentially a table of n+1 estimates) such
that L(n,f(n,h),p) = g(n) where g(n) is free of p. And further: the partial
derivative of L(n,,) with respect to any of the entries of f(n,h) evaluated at
this f(n,h) are not p-free. In fact there are always p-s that allow us to
freely choose the sign of this gradient.
Enough to claim:
argminf(n,h) max
Examples:
p
L(n, f(n, h), p) = rootf(n,h)L(n, f(n, h), p) f(n, 0)2
L(1,(1/4,3/4),p) = 1/16
L(2,(-1/2 + sqrt(2)/2,1/2,-sqrt(2)/2 + 3/2),p) = -sqrt(2)/2 + 3/4
34 L(n,f(n,h),p) = sum_{k=0}^{n} text{P}[k text{ wins} | n,p] times
( f(n,k)-p )^2
text{argmin}_{f(n,h)} max_p L(n,f(n,h),p) = text{root}_{f(n,h)}
L(n,f(n,h),p) - f(n,0)^2
!W
e know L(n,f(n,h),p) is convex in f(n,h), so max_p L(n,f(n,h)) is
also convex in f(n,h). We are not looking at the usual Karush–
Kuhn–Tucker conditions of optimality. What I think is going on is
M(n,f(n,h)) = max_p L(n,f(n,h),p) is majorized by L(,,), so we are
collecting evidence of the optimal point through p. What is
exciting is we get rid of quantifiers, making the problem much
easier.
!See http://winvector.github.io/freq/explicitSolution.html and
https://github.com/WinVector/Examples/blob/master/freq/
python/explicitSolution.rst for more details.
The l2 minimax solution in a
graph
Solution of the form
L(1,(lambda,1-lambda),p).
Notice best minimax solution is
at f(1,h) = (0.25,0.75).
Notice all p-curves cross there.
Also notice if you move from
0.25, you can always find a p
that makes things worse.
This proves the solution is a
local minima, so by convexity it
is also the global optimum.
35 So it is just a matter of checking the stated solution clears the p’s
out of L(k,,p). Leonard J. Savage gives this example on page 203
of the 1972 edition of “The Foundations of Statistics.” He
attributes it to: “Some Problems in Minimax Point Estimation” J L
Hodges and E L Lehmann, The Annals of Mathematical Statistics,
1950 vol. 21 (2) pp. 182-197.
A few exact l1/l2 solutions
1-round l2 solution: (1/4, 3/4) (also the 1-round l2 solution)
2-round l2 solution: (-1/2 + sqrt(2)/2, 1/2, -sqrt(2)/2 + 3/2)
~ (0.207, 0.5, 0.793)
Not the same as the 2-round l1 solution: (0.192, 0.5, 0.808)
36 Again this game is to build a best l1 or l2 estimate for any p in
the range 0 through 1. Each estimate is biased (as they don’t
agree with the traditional empirical frequentist estimate), but the
bias is going down as n goes up. Also these estimates are not
the traditional Bayesian ones as they don’t three with anything
coming from traditional priors (notice the non-rational values).
These are related to what Wald called “logical Bayes” where the
Bayesian method is used, but we don’t insist on priors (but
instead solve a minimax problem- where we try to do well under
worst-possible initial distributions).
13. Table of estimates
1/1
2/2
1/2
3/3
3/3
3/3
2/3
4/4
4/4
4/4
3/4
2/4
5/5
5/5
5/5
5/5
4/5
4/5
3/5
6/6
6/6
6/6
6/6
5/6
5/6
4/6
3/6
7/7
7/7
7/7
7/6/7
6/7
5/7
5/7
5/7
4/7
8/8
8/8
8/8
7/8/8
7/8
7/8
6/8
6/8
5/8
4/8
9/9
9/9
9/9
8/9
9/9
8/9
8/9
7/9
7/9
6/9
6/9
5/9
10/10
10/10
10/10
9/10
10/10
9/10
9/10
8/10
8/10
7/10
7/10
7/10
6/10
5/10
1/3
1/3
1/3
0/1 0/2
0/3
1/4
1/4
1/4
0/4
2/5
2/5
2/5
1/5
1/5
1/5
0/5
2/6
2/6
2/6
1/6
1/6
1/6
0/6
3/7
3/7
3/7
2/7
2/7
2/7
1/7
1/7
1/7
0/7
3/8
3/8
3/8
2/8
2/8
2/8
1/8
1/8
1/0/8
4/9
4/9
3/9
3/9
3/9
2/9
2/9
2/9
1/9
1/9
1/9
0/9
4/10
4/10
4/10
3/10
3/10
3/10
2/10
2/10
2/10
1/1/10
1/10
0/10
1/1
2/2
2/3
3/4
3/5
4/6
4/7
5/8
6/9
6/10
0/1
0/2
0/3
0/4
0/5
0/6
0/7
0/8
0/9
0/10
2/2
3/3
4/4
4/5
5/6
6/7
6/8
7/9
8/10
0/2
0/3
0/4
0/5
0/6
0/7
0/8
0/9
0/10
2/2
2/3
3/4
3/5
4/6
4/7
5/8
5/9
6/10
0/2
0/3
0/4
0/5
0/6
0/0/8
0/9
0/10
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9 10
n
phi
estName
aaaa
a
Bayes (Jeffreys)
Frequentist
l1 minimax
l2 minimax
37 For each of the four major estimates we discussed we show the
chosen estimate phi for h-heads out of n-flips. In general
frequentist is outside, Bayes, which is outside l1 minimax which
is outside l2 minimax. l1 and l2 interior solutions are very close.
This is a graph of a ready to go decision table (an user could
forget everything up until here and just pick their phis off the
graph). Notice frequentist solution crosses l2 minimax around
n=8. Also all solutions except l1 minimax are equally spaced
when n is held fixed. For more details see: https://github.com/
WinVector/Examples/blob/master/freq/python/freqMin.rst
Or: consider this table no easier
to use …
Frequentist
h
n 0 1 2 3 4 5
1 0.0000000 1.0000000
2 0.0000000 0.5000000 1.0000000
3 0.0000000 0.3333333 0.6666667 1.0000000
4 0.0000000 0.2500000 0.5000000 0.7500000 1.0000000
5 0.0000000 0.2000000 0.4000000 0.6000000 0.8000000 1.0000000
38 Obviously you don’t need the table for frequentist as h/(h+t) is
easy to remember.
than to use:
l2 minimax
h
n 0 1 2 3 4 5
1 0.2500000 0.7500000
2 0.2071068 0.5000000 0.7928932
3 0.1830127 0.3943376 0.6056624 0.8169873
4 0.1666667 0.3333333 0.5000000 0.6666667 0.8333333
5 0.1545085 0.2927051 0.4309017 0.5690983 0.7072949 0.8454915
39 And the point is: depending on your goals this table might be the
one you want. However, be warned the l2 minimax adding of
sqrt(n) pseudo-observations is an uncommon procedure. You
want to check if you really want that.
14. And that is it
40
What to take away
Deriving or justifying optimal inference techniques on even simple dice games
can bring in a lot of heavy calculation. If you don’t find that worrying, then you
aren’t paying attention.
For standard situations statisticians did the heavy calculations a long time ago
and packaged up good and simple procedures (the justifications are difficult, but
you don’t have to repeat the justifications each time you apply the methods).
Unbiasedness is just one desirable property among many. If you accept it is
required you are often forced to accept traditional empirical frequentists
estimates as only possible and best possible (not always a good thing).
Differences in Bayesian and frequentist assumptions lead not only to different
hypothesis testing paradigms (confidence intervals versus credible intervals)-
they also pick different “optimal” estimates. Best answer depends on your use
case (not your sense of style).
41
Thank you
42
15. Links
! iPython notebook of most of these results/graphs:
https://github.com/WinVector/Examples/blob/master/freq/python/freqMin.rst
!
More on this topic:
http://www.win-vector.com/blog/2014/07/frequenstist-inference-only-seems-easy/
http://www.win-vector.com/blog/2014/07/automatic-bias-correction-doesnt-fix-omitted-variable-bias/
!
For more information please try our blog:
http://www.win-vector.com/blog/
and our book
“Practical Data Science with R”
http://practicaldatascience.com .
!
Please contact us with comments, questions,
ideas, projects at:
jmount@win-vector.com
43 ipython notebook working through all these examples https://
github.com/WinVector/Examples/blob/master/freq/python/
freqMin.rst