SlideShare a Scribd company logo
1 of 49
Download to read offline
Statistics for Non-Statisticians
Gerald Belton
Statistics:
A science of collection,
presentation, analysis,
and interpretation of
numerical data.
The science of statistics is based on
probability.
Discrete distributions describe data that can
only take specific values.
A coin toss is an example of a Bernoulli
distribution.
0
0.1
0.2
0.3
0.4
0.5
0.6
Heads Tails
Probability
Bernoulli Distribution
A Binomial Distribution results from multiple
coin tosses.
0.001
0.0098
0.0439
0.1172
0.2051
0.2461
0.2051
0.1172
0.0439
0.0098
0.001
0
0.05
0.1
0.15
0.2
0.25
0 1 2 3 4 5 6 7 8 9 10
Probability
Number of heads
Tossing a coin 10 times
Rolling one die can be described with a
Uniform distribution.
0.167 0.167 0.167 0.167 0.167 0.167
0.000
0.050
0.100
0.150
0.200
0.250
1 2 3 4 5 6
Probability
Number Rolled
Rolling one die
0.028
0.056
0.083
0.111
0.139
0.167
0.139
0.111
0.083
0.056
0.028
0.000
0.050
0.100
0.150
0.200
0.250
2 3 4 5 6 7 8 9 10 11 12
Probability
Number Rolled
Rolling two dice
Continuous distributions describe data that
can take infinitely many values.
Rainfall amounts follow an exponential
distribution.
The Normal Distribution is a very special
continuous distribution.
1
2𝜋𝜎2
𝑒
−
(𝑥−𝜇)2
2𝜎2
Lots of real-world measures are “sort of”
normally distributed.
Here’s an idealized normal distribution.
Here’s an idealized normal distribution.
68%
Here’s an idealized normal distribution.
σ
Here’s an idealized normal distribution.
95%
99.7%
Central Limit Theorem makes other
distributions “act normal.”
Descriptive statistics tell us about the world.
Visualizations quickly convey information.
Census Map of NC
Florence
Nightingale
Florence Nightingale
Minard’s Map
Numerical descriptions provide more detail.
Location: Mean, Median, Mode
Spread: Variance, Std Dev
Five number summary
> summary(GaltonFathers$father)
Min. 1st Qu. Median Mean 3rd Qu. Max.
62.00 68.00 69.50 69.32 71.00 78.50
>
We have tools for looking at the relationship
between variables.
Correlation
Not Causation!
Spurious Correlation Example
Statistical Inference uses properties of a
sample to explain a population.
Population Sample
StatisticsParameters
Sampling
Technique
Inference
Sampling is extremely important.
Online Survey Example
Simple Random Sample vs.
Stratified Random Sample
Sample Size vs Precision
We use data to build models of reality.
Confidence Intervals, Hypothesis Testing, p-
value
• Null Hypothesis: What we are hoping to disprove.
• Alternative Hypothesis: What we hope to prove.
• P-value: The probability of observing results at least as extreme as
these, if the null hypothesis is true.
When we get it wrong
α
β
Another way to remember it
Significance is important, but significant
results might not be.
Significant <> Important
P-hacking: false significance
Goodheart’s Law: when a measure become a target, it is no
longer a measure
Measuring weirdness
Measuring Weirdness
Measuring Weirdness in two dimensions
Probability
Descriptive Statistics
Inference
Questions?
Contact me:
email: gerald.belton@gmail.com
website: http://www.geraldbelton.com
LinkedIn: https://www.linkedin.com/in/beltongerald/

More Related Content

What's hot

Introduction To Statistics
Introduction To StatisticsIntroduction To Statistics
Introduction To Statisticsalbertlaporte
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statisticsguest290abe
 
Introduction to SAS
Introduction to SASIntroduction to SAS
Introduction to SASizahn
 
Introduction to basics of bio statistics.
Introduction to basics of bio statistics.Introduction to basics of bio statistics.
Introduction to basics of bio statistics.AB Rajar
 
determination of sample size
determination of sample sizedetermination of sample size
determination of sample sizeJijo Varghese
 
Introduction to Statistics
Introduction to StatisticsIntroduction to Statistics
Introduction to StatisticsRobert Tinaro
 
Data analysis
Data analysisData analysis
Data analysisLizzyL1
 
Survival Analysis Using SPSS
Survival Analysis Using SPSSSurvival Analysis Using SPSS
Survival Analysis Using SPSSNermin Osman
 
Stat3 central tendency & dispersion
Stat3 central tendency & dispersionStat3 central tendency & dispersion
Stat3 central tendency & dispersionForensic Pathology
 
Deciding on a medical research topic: your first challenge
Deciding on a medical research topic: your first challengeDeciding on a medical research topic: your first challenge
Deciding on a medical research topic: your first challengeAzmi Mohd Tamil
 
Review & Hypothesis Testing
Review & Hypothesis TestingReview & Hypothesis Testing
Review & Hypothesis TestingSr Edith Bogue
 
Simple understanding of biostatistics
Simple understanding of biostatisticsSimple understanding of biostatistics
Simple understanding of biostatisticsHamdi Alhakimi
 

What's hot (20)

Introduction To Statistics
Introduction To StatisticsIntroduction To Statistics
Introduction To Statistics
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
 
What is statistics
What is statisticsWhat is statistics
What is statistics
 
Introduction to SAS
Introduction to SASIntroduction to SAS
Introduction to SAS
 
INTRODUCTION TO SAS
INTRODUCTION TO SASINTRODUCTION TO SAS
INTRODUCTION TO SAS
 
Sampling Technique.pptx
Sampling Technique.pptxSampling Technique.pptx
Sampling Technique.pptx
 
Introduction to basics of bio statistics.
Introduction to basics of bio statistics.Introduction to basics of bio statistics.
Introduction to basics of bio statistics.
 
determination of sample size
determination of sample sizedetermination of sample size
determination of sample size
 
Introduction to Statistics
Introduction to StatisticsIntroduction to Statistics
Introduction to Statistics
 
Data analysis
Data analysisData analysis
Data analysis
 
On p-values
On p-valuesOn p-values
On p-values
 
Analysis Of Medical Data
Analysis Of Medical DataAnalysis Of Medical Data
Analysis Of Medical Data
 
Survival Analysis Using SPSS
Survival Analysis Using SPSSSurvival Analysis Using SPSS
Survival Analysis Using SPSS
 
Data
DataData
Data
 
Stat3 central tendency & dispersion
Stat3 central tendency & dispersionStat3 central tendency & dispersion
Stat3 central tendency & dispersion
 
Presentation of data
Presentation of dataPresentation of data
Presentation of data
 
Deciding on a medical research topic: your first challenge
Deciding on a medical research topic: your first challengeDeciding on a medical research topic: your first challenge
Deciding on a medical research topic: your first challenge
 
Review & Hypothesis Testing
Review & Hypothesis TestingReview & Hypothesis Testing
Review & Hypothesis Testing
 
1.2 types of data
1.2 types of data1.2 types of data
1.2 types of data
 
Simple understanding of biostatistics
Simple understanding of biostatisticsSimple understanding of biostatistics
Simple understanding of biostatistics
 

Similar to Statistics for Non-Statisticians

Quantitative Methods for Lawyers - Class #6 - Basic Statistics + Probability ...
Quantitative Methods for Lawyers - Class #6 - Basic Statistics + Probability ...Quantitative Methods for Lawyers - Class #6 - Basic Statistics + Probability ...
Quantitative Methods for Lawyers - Class #6 - Basic Statistics + Probability ...Daniel Katz
 
Making Statistics Work For Us: Item Bias, Decision Making, and Data-Driven Si...
Making Statistics Work For Us: Item Bias, Decision Making, and Data-Driven Si...Making Statistics Work For Us: Item Bias, Decision Making, and Data-Driven Si...
Making Statistics Work For Us: Item Bias, Decision Making, and Data-Driven Si...Quinn Lathrop
 
Morestatistics22 091208004743-phpapp01
Morestatistics22 091208004743-phpapp01Morestatistics22 091208004743-phpapp01
Morestatistics22 091208004743-phpapp01mandrewmartin
 
Statistical thinking
Statistical thinkingStatistical thinking
Statistical thinkingmij1120
 
Chapter8 Introduction to Estimation Hypothesis Testing.pdf
Chapter8 Introduction to Estimation Hypothesis Testing.pdfChapter8 Introduction to Estimation Hypothesis Testing.pdf
Chapter8 Introduction to Estimation Hypothesis Testing.pdfmekkimekki5
 
6 estimation hypothesis testing t test
6 estimation hypothesis testing t test6 estimation hypothesis testing t test
6 estimation hypothesis testing t testPenny Jiang
 
Statistice Chapter 02[1]
Statistice  Chapter 02[1]Statistice  Chapter 02[1]
Statistice Chapter 02[1]plisasm
 
Normal and standard normal distribution
Normal and standard normal distributionNormal and standard normal distribution
Normal and standard normal distributionAvjinder (Avi) Kaler
 
03-Data-Analysis-Final.pdf
03-Data-Analysis-Final.pdf03-Data-Analysis-Final.pdf
03-Data-Analysis-Final.pdfSugumarSarDurai
 
advanced_statistics.pdf
advanced_statistics.pdfadvanced_statistics.pdf
advanced_statistics.pdfGerryMakilan2
 
Review Z Test Ci 1
Review Z Test Ci 1Review Z Test Ci 1
Review Z Test Ci 1shoffma5
 

Similar to Statistics for Non-Statisticians (20)

Quantitative Methods for Lawyers - Class #6 - Basic Statistics + Probability ...
Quantitative Methods for Lawyers - Class #6 - Basic Statistics + Probability ...Quantitative Methods for Lawyers - Class #6 - Basic Statistics + Probability ...
Quantitative Methods for Lawyers - Class #6 - Basic Statistics + Probability ...
 
Making Statistics Work For Us: Item Bias, Decision Making, and Data-Driven Si...
Making Statistics Work For Us: Item Bias, Decision Making, and Data-Driven Si...Making Statistics Work For Us: Item Bias, Decision Making, and Data-Driven Si...
Making Statistics Work For Us: Item Bias, Decision Making, and Data-Driven Si...
 
Morestatistics22 091208004743-phpapp01
Morestatistics22 091208004743-phpapp01Morestatistics22 091208004743-phpapp01
Morestatistics22 091208004743-phpapp01
 
Statistical thinking
Statistical thinkingStatistical thinking
Statistical thinking
 
Sampling
SamplingSampling
Sampling
 
Chapter8 Introduction to Estimation Hypothesis Testing.pdf
Chapter8 Introduction to Estimation Hypothesis Testing.pdfChapter8 Introduction to Estimation Hypothesis Testing.pdf
Chapter8 Introduction to Estimation Hypothesis Testing.pdf
 
6 estimation hypothesis testing t test
6 estimation hypothesis testing t test6 estimation hypothesis testing t test
6 estimation hypothesis testing t test
 
Statistice Chapter 02[1]
Statistice  Chapter 02[1]Statistice  Chapter 02[1]
Statistice Chapter 02[1]
 
05inference_2011.ppt
05inference_2011.ppt05inference_2011.ppt
05inference_2011.ppt
 
Normal and standard normal distribution
Normal and standard normal distributionNormal and standard normal distribution
Normal and standard normal distribution
 
02a one sample_t-test
02a one sample_t-test02a one sample_t-test
02a one sample_t-test
 
03-Data-Analysis-Final.pdf
03-Data-Analysis-Final.pdf03-Data-Analysis-Final.pdf
03-Data-Analysis-Final.pdf
 
Chapter 5
Chapter 5Chapter 5
Chapter 5
 
Sampling Distribution
Sampling DistributionSampling Distribution
Sampling Distribution
 
Applied statistics part 1
Applied statistics part 1Applied statistics part 1
Applied statistics part 1
 
Binomial Probability Distributions
Binomial Probability DistributionsBinomial Probability Distributions
Binomial Probability Distributions
 
advanced_statistics.pdf
advanced_statistics.pdfadvanced_statistics.pdf
advanced_statistics.pdf
 
More Statistics
More StatisticsMore Statistics
More Statistics
 
Ds vs Is discuss 3.1
Ds vs Is discuss 3.1Ds vs Is discuss 3.1
Ds vs Is discuss 3.1
 
Review Z Test Ci 1
Review Z Test Ci 1Review Z Test Ci 1
Review Z Test Ci 1
 

Recently uploaded

What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfPratikPatil591646
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...boychatmate1
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 

Recently uploaded (20)

What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdf
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 

Statistics for Non-Statisticians

Editor's Notes

  1. Before we can talk about statistics, we need to talk about probability. Specifically, we need to talk about the two kinds of probability distributions:
  2. Rolling dice always gives an integer value. A coin toss is either heads or tails. A chart of all the possible outcomes illustrates a discrete distribution.
  3. Every time you toss a coin, it comes up either heads or tails – unless a seagull snatches it out of the air, or it sticks sideways in the ground. If it’s a fair coin, we expect to get heads 50% of the time and tails 50% of the time. Whenever we have two possible outcomes, we have a Bernoulli Distribution.
  4. If we toss a coin 10 times and count the number of times it comes up heads, and then do that same thing over and over again, this is how the results should look. We have a very small (but not zero!) chance of 10 heads or 10 tails; five of each is the most likely outcome, and we should see that about 25% of the time.
  5. We see the same thing with dice. If we roll one die, there are six possible outcomes, and each outcome has a probability of 1/6. But if we roll two dice, we have 10 possible outcomes; 7 is the most likely outcome, while 2 and 12 are the least likely.
  6. If we are measuring blood sugar, or temperature, or many other things, the result is a continuous distribution. The number of possible outcomes is infinite (even though it may be bounded).
  7. This graph shows the number of days in 2017 with various amounts of rainfall at RDU airport. Days with no rain are not included here; We see most rainy days have less than a quarter-inch of rainfall, and as the amount of rain increases, the number of days decreases.
  8. This curve has some interesting properties… for one thing, its length is infinite but the area under the curve is exactly 1.
  9. This graph represents the height, in inches, of 205 men measured in England in the1880’s. The data was collected by Francis Galton, a cousin of Charles Darwin. Galton studied the heights of 205 men, their wives, and their adult children. Galton, F. (1886). Regression Towards Mediocrity in Hereditary Stature Journal of the Anthropological Institute, 15, 246-263
  10. Here, I’ve colored sections of the curve to show the inflection points. The red part of the curve is concave-down; the blue parts are concave-up. The boundary between red and blue is the inflection point, where the curve changes direction.
  11. I’ve added a line at the center of the graph, labeled “mu” because statisticians love Greek letters. Mu refers to the population mean, or average, and it coincides with the peak of the normal curve. I’ve also added vertical lines at the inflection points. The distance from mu to these lines is called the standard deviation, or sigma (because statisticians love Greek letters). In any normal distribution, 68% of the population will fall within 1 standard deviation of the mean.
  12. I’ve added lines here for 2 and 3 standard deviations from the mean. 95% of the population falls within 2 (actually, 1.96) standard deviations, and 99.7% within 3 standard deviations.
  13. How can the sample mean have a distribution? Isn’t it just one number? If we take REPEATED samples, we will get a different mean from each different sample. So the mean from our random sample is one observation chosen at random from this distribution.
  14. Enough about probability. Let’s talk about statistics. First, we’ll talk about descriptive statistics. By “descriptive statistics,” we mean numbers that tell us something about a population or a sample.
  15. I really like data visualizations, they can provide a lot of information in a compact, easily digested form.
  16. For example, here’s some information about North Carolina from the census bureau. (point out features of this infographic)
  17. Does anyone know who this is? Are there any nurses here?
  18. Yes, that’s Florence Nightingale. Most people know her as the founder of modern nursing, but she was also a pioneer in the use of statistics.
  19. This is Nightingale’s famous “Diagram of the causes of mortality in the army of the East,” which she made during the Crimean war. The pink areas represent the number of soldier’s who died from battle wounds, the blue areas represent the number of soldiers who died from poor sanitation or infections, and the black areas represent deaths from all other causes. After she insisted that nurses and doctors wash their hands, and after the sanitation commission flushed out the sewers and improved ventilation in the hospitals, the death rate dropped dramatically. This chart was instrumental in bringing about those reforms.
  20. This is a map drawn by Charles Minard in 1869. Noted information designer Edward Tufte says it "may well be the best statistical graphic ever drawn.” (point out features of the map)
  21. Not everything can, or should be, made into a chart. Sometimes you just need to see the numbers. There are certain numbers that are essential to understanding the distribution of a data set.
  22. Measures of location, or central tendency: mean (arithmetic average), median (half the data are below, half above) and mode (most common occurring number) Skewed data results from outliers. Consider an extreme example, a village with 100 workers and one factory owner. The workers are each paid $10/year, and the factory owner makes $1,000,000 per year. The mean wage is $9,911 per year, but the median and the mode are both $10.
  23. Measures of spread: variance, standard deviation In the variance formula we measure the distance from each data point to the mean… some will be negative, some positive, so we square them to keep them from cancelling out. Then we add together all of those squared differences, and divide by the number of data points. This average squared distance is called the variance. If we take the square root of the variance, we get the standard deviation.
  24. Fence: 1-1/2 times the interquartile range from the median. Points beyond the fence are marked as outliers.
  25. Population examples: Everyone in North Carolina; Adults over 50 with high blood pressure; All of the Medicaid claims filed by a specific provider between 1Jul2017 and 31Dec2017. Parameters: A number that describes the population: Median age of people in North Carolina; Average Systolic BP of A50+; Amount Medicaid overpayed the provider. Sample: We can’t measure the entire population, so we draw a random sample. Statistics: Numbers that describe the sample (rather than the population): We use statistics measured on a random sample to infer the parameters of the population.
  26. There are a lot of different sampling methods, but it is important that they be random in order to avoid biasing the results.
  27. Do we believe the results of this survey? Why? Website surveys like this are not representative of the population, because the respondents are not chosen at random.
  28. In a simple random sample, every member of the population has the same probability of being selected. In a stratified random sample, every member of a subgroup (strata) in the population has the same probability of being selected as every other member of the same subgroup.
  29. These are formulas for the standard deviation of two different types of data… what they have in common is “n”, the number of observations in the sample. The bigger this number gets, the smaller the spread of the data.
  30. George Edward Pelham was a British statistician, who has been called "one of the great statistical minds of the 20th century“ All models are wrong: there are no perfect spheres in nature. Some are useful: We can divide the earth’s surface as if it were a sphere, and the results are good enough to locate objects with our GPS systems.
  31. As we said before, we measure a variable across our sample, calculate a statistic, and use that statistic to estimate the parameter for the population. The result is never exact but the good news is that we can describe just how inexact it is!
  32. We can describe the inherent uncertainty in our data using confidence intervals; we can use our results to test a hypothesis, specify the results using a p-value. Definitions: on the slide Explain graphs briefly
  33. Type 1 Error: alpha (because we love Greek letters!) is the probability of making a type 1 error. False positive; Type II Error: beta (because we love Greek letters!) is the probability of making a type 2 error. False negative. Notice that the null hypothesis is never proven; we either reject it or we fail to reject it. Just like in a courtroom trial, where the defendant is never found innocent, only “not guilty.” Courtroom: Null Hypothesis = Defendant is innocent. Prosecutor has to prove guilt in order to reject the null hypothesis.
  34. Ha Ha!
  35. Imagine a clinical experiment where we can conclusively prove that a new drug will lower blood pressure. But it only lowers it by an average of 1 point, say from 140 to 139. The result is statistically significant but nobody cares, because it is not clinically important.
  36. Clinical trials are set up to look for a “clinically important difference.” Sample size is chosen so that if that difference exists, there will be a specific probability of detecting it. This is called the “power” of the trial. Power is 1 minus beta. Confidence is 1 minus alpha.
  37. P: the probability that, if the null hypothesis is true, we would observe results at least as extreme as the ones we have observed. It is common to use 0.05 as the cutoff for statistical significance, but this is arbitrary. Also, by increasing the sample size, we can ALWAYS get a result with p < 0.05 or any other arbitrary level.
  38. Here again is the Galton data on the heights of adult males in England in the 1880’s. It follows a normal distribution (more or less), and we note that there are a couple of men in his sample who are unusually short, and one who is unusually tall. The tall guy here is 6 feet 6 inches, by the way. The mean height is about 69 inches and the standard deviation is 2 and a half inches. Our tall guy is about 3.6 standard deviations taller than the average. Based on our normal distribution we can calculate that he would be taller than 99.98% of the population.
  39. Just for fun, I’ve added one data point to the graph: Shaq! At 85” tall, Shaq is 6.4 standard deviations above the average. He’s taller than 99.999999% of the population!
  40. Here we have data points plotted across two correlated variables. The circled point is not an extreme outlier in either dimension, but it’s far away from the mass of spots in the ellipse. We can generalize this to any number of dimensions, but it’s hard to visualize. But we can express it mathematically, and the difference between the chosen point and the center of the data is called the Mahalanobis distance.