This document provides an overview of statistics and the role of statisticians. It begins by introducing the author, Corey Chivers, who is a PhD student studying biological invasions using statistics. It then defines a statistician as someone who turns data into insights, answers questions about the world, and isn't necessarily fun to talk to at parties. The document discusses how statisticians assume the world is boring under the null hypothesis and look for evidence against it. It provides examples of descriptive statistics like variance, standard deviation, and the mean. It also introduces hypothesis testing and the Student's t-test for comparing two groups and determining if any differences could be due to chance.
5. What is a Statistician?
A statistician is
someone who:
6. What is a Statistician?
A statistician is
● Turns data into insights.
someone who:
7. What is a Statistician?
A statistician is
● Turns data into insights.
someone who: ● Answers questions about the world.
8. What is a Statistician?
var
iat
A statistician is
● Turns data into insights. io n i
n
someone who: ● Answers questions about the world.
9. What is a Statistician?
var
iat
A statistician is
● Turns data into insights. io n i
n
someone who: ● Answers questions about the world.
● Isn't fun to talk to at a party?
16. Portrait of a Statistician
The cool kids are calling themselves Data Scientists
17. Portrait of a Statistician
The cool kids are calling themselves Data Scientists
Name: Hilary Mason
Title: Chief Data Scientist at bit.ly
member of Mayor Bloomberg’s Technology
and Innovation Advisory Council
From her web bio:
“I <3 data and cheeseburgers.”
18. What do you know about statistics?
● On a piece of paper, make a list of all the
words you know about statistics.
● I'll start:
– Average (mean)
– Variance
– Normal distribution
– ...
19. Despite how exciting we are,
statisticians always start by
assuming the world is boring
The Null Hypothesis, or Ho is this boring world.
20. Despite how exciting we are,
statisticians always start by
assuming the world is boring
The Null Hypothesis, or Ho is this boring world.
Usually something like “there is no effect of
caption size on the lulzyness of LOLcats”
21. Looking for evidence against the
Null Hypothesis
● The alternative hypothesis (Ha) is that
something interesting is going on.
– Ex: “Bigger captions are, on average, funnier”
● How would we know?
22. Looking for evidence against the
Null Hypothesis
● The alternative hypothesis (Ha) is that
something interesting is going on.
– Ex: “Bigger captions are, on average, funnier”
● How would we know?
● To the internetz!
23. Collect some sample data!
Small caption,
fairly
humourous
Big caption, quite funny
Small caption, funny-ish
Big caption, peed in pants a little
24. Dealing with variability
Some small caption images
are funny, and some large
caption images are not funny.
There is variance in the data.
But we want to know if there
is a difference on average.
We'll need to take variance
into account.
25. Descriptive Statistics
Measures of Variability
Variance Standard Deviation
√
n n
∑ ( xi − ̄)
x 2
∑ (x i − ̄ )
x
2
2 i=1
s= s=
i=1
n−1 n−1
Where xi = the ith value of a distribution
n = number of values in the sample
x = sample mean
27. Your turn
Calculate the variance of the heights in your group.
n 1) Write down your heights (xi)
∑ ( x i − ̄ )2
x 2) Calculate the average (Σxi / n)
3) Subtract the average for each
2 i=1
s= height and square it
n−1 4) Add them all up and divide by n-1
29. Measures of Central Tendency
Calculating the Mean
Using the following distribution of values:
1, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 7, 7, 8, 9
(Arithmetic) Mean – the average of a distribution of values
n
∑ xi or Sum of values in dist’n
x i =1 Number of values in dist’n
̄ =
n−1
1+2+3+3+4+4+4+5+5+5+5+5+6+6+6+7+7+8+9
19 =5
30. Could the difference be due to
chance?
Remember, we started by
assuming that there was no
difference (the Null
Hypothesis).
If the Null Hypothesis is
true, what are the
chances that we
observed this amount of
difference between
groups?
How do we decide whether
the difference is due to
chance or not?
By vote???
31. A better way: (formal) Hypothesis
testing
● Determine in advance the level of error you
are willing to put up with.
– We cannot avoid the chance of errors, but we can
decide how often we are willing to have them
happen.
● Biologist like to use 0.05 (a 1 in 20 chance).
● We call this α (alpha)
32. A better way: (formal) Hypothesis
testing
● Determine in advance the level of error you
are willing to put up with.
– We cannot avoid the chance of errors, but we can
decide how often we are willing to have them
happen.
● Biologist like to use 0.05 (a 1 in 20 chance).
● We call this α (alpha)
Ronald Fisher:
The man behind
the idea of NHST
33. A better way: (formal) Hypothesis
testing
● Calculate how likely your data set is if the null
were true.
● If it is less than α, we say that we reject the
null hypothesis.
● If we reject the null, we say the results are
statistically significant.
34. A better way: (formal) Hypothesis
testing
● Calculate how likely your data set is if the null
were true.
● If it is less than α, we say that we reject the
null hypothesis.
● If we reject the null, we say the results are
statistically significant.
“The world is not boring afterall!”
35. Lets do it!
● To calculate how likely it is that our data is
from the null hypothesis (ie difference is due to
chance), we need a statistic.
● But first, some Beer!
36. Student's t-test
● William Sealy Gosset figured out how to test if
a batch of beer was significantly different than
the standard.
While working for the
Guinness brewing company,
he was forbidden to publish
academic research, so
published his method under
the pseudonym 'student'.
37. Student's t-test
The t-value is calculated using the following equation:
X 1− X 2
̄ ̄
t=
√
2 2
s s
1 2
+
n1 n2
Where x 1 and x 2 are the means of the experimental and control
groups;
S12 and S22 are the variances of the experimental and control groups;
n1 and n2 are the sample sizes for the experimental and control
groups.
38. Student's t-test
The t-value is calculated using the following equation:
X 1− X 2
̄ ̄
t=
√
2 2
s s
1 2
+
n1 n2
Where x 1 and x 2 are the means of the experimental and control
groups;
S12 and S22 are the variances of the experimental and control groups;
n1 and n2 are the sample sizes for the experimental and control
groups.
39. t-test
State your alpha level
α = 0.05
If the t-test detects a difference between the means,
there is a 5% chance that this conclusion is
incorrect.
40. Calculating your t-value
Generic-brand Name-brand
(Group 2) (Group 1)
Mean # of chips x 2 = 11.2 x 1 = 15.3
Standard S2 = 4.3 S1 = 2.4
Deviation
n (sample size) n1 = 3 n2 = 3
X 1− X 2
̄ ̄
t= According to the data above:
√
2 2
s1 s 2 calculated t = 1.4
+
n1 n2
41. Alternate Hypothesis
You can only test ONE possible alternate hypothesis at
any one time. The one chosen depends on what you are
looking to find.
Alternative hypothesis: 2 types
2-tailed
Non-directional (general): not specifying a
direction.
“The two groups are not the same”
1-tailed
Directional (specific): specify direction
“Group A is greater than group B.”
42. Look up the Critical t-value
In order to find your critical t-value, you need 3 pieces of
information:
1. Whether the alternate hypothesis is 1- or 2-tailed
2. Alpha level (usually = 0.05)
3. Degrees of freedom (df = n-1)
Calculating degrees of freedom (df)
Degrees of Freedom = n-1
What if you have 2 different sample sizes (n1 and n2)…
which do you pick to calculate your degrees of freedom?
A: df = the smallest of : (n1-1) or (n2 –1)
44. Compare your ‘calculated’ t-
value with your ‘critical’ t-value
It is the difference in values between the t-value and critical t that
will determine whether you can reject or fail to reject your null
hypothesis
a) If ‘calculated’ > ‘critical’, then: reject null hyp.
“My observed data are really unlikely under the
null hypothesis, therefore I reject the null
hypothesis!”
b) If ‘calculated’ < ‘critical’, then: do NOT reject null
hyp.
“My observed data are consistent with the null
hypothesis, therefore I have no reason to believe
that it is not true.”
45.
46. What if we are measuring a
category, rather than a number?
● The t-test lets us compare the value of some
attribute between two groups.
– Do mutant fruit flies live longer than wild type?
– Does IQ differ between Dawson and Laurier students?
– Does drug x decrease blood pressure?
● The dependent variable is quantitative:
– Life span
– IQ
– Blood pressure
47. What if we are measuring a
category, rather than a number?
● Chi-squared test lets us test hypotheses
about categories.
– Are there more cars of a certain colour getting speeding
tickets?
– Is the ratio of dominant to recessive phenotypes 3:1?
– Do chromosomes assort independently?
● The dependent variable is categorical:
– Car colour
– Phenotype
– Chromosome donor
48. Chi-square or T-test???
How do you know which one you need?
T-Test Chi-square Test
• the dependent variable is • the dependent variable is
quantitative qualitative (aka. Nominal data)
(e.g. height, weight, etc.) (e.g. gender, colour, etc.)
• data can be organized as two • data can be easily tabulated as
lists of numbers counts:
Example: Room Cold Example:
temp temp
(bpm) (bpm) Male 98
178 86 Femal 102
e
169 89
192 55
(dependent variable: gender)
(dependent variable: heart rate)
49. Steps to performing a chi-square test
1. State your null hypothesis
2. State your alternate hypothesis
3. State your alpha level (usually α = 0.05)
4. Calculate your ‘calculated chi-square value’
5. Look up your ‘critical chi-square value’ (from chi-square
table)
6. Compare your ‘calculated’ and ‘critical’ values
a) If ‘calculated’ > ‘critical’, conclusion: reject null hyp.
b) If ‘calculated’ < ‘critical’, conclusion: do NOT reject null
hyp.
7. State your conclusion
50. Sample hypotheses for chi-square
Sex ratio in our class
Null hypothesis
1. There is no difference between the frequency of
men and women in the class
____________________________
2. There is a difference between the frequency of men
Alternative hypothesis
and women in the class
Chi-square can only test non-
directional alt. hyp.
51. Calculating Chi-square
‘Calculated’ chi-square values are calculated
using the following formula:
O = observed
E = expected
Calculating the chi-square is easier using the following table:
Gender O E O-E (O-E)2 (O-E)2
E
Female
Male
χ2 = sum of last column =
52. Looking up the Critical χ2
To find the critical χ2 , you need the alpha
level and the df.
Df for a χ2 test = (# of categories) – 1
In our example, df = 2-1 = 1
53. Compare your ‘calculated’ chi-
sq with your ‘critical’ chi-sq
It is the difference between the calculated chi-sq and critical chi-sq
that will determine whether you can reject or fail to reject your
null hypothesis
a) If ‘calculated’ > ‘critical’, then: reject null hyp.
“My observed data are really unlikely under the
null hypothesis, therefore I reject the null
hypothesis!”
b) If ‘calculated’ < ‘critical’, then: do NOT reject null
hyp.
“My observed data are consistent with the null
hypothesis, therefore I have no reason to believe
that it is not true.”
55. Questions for Corey
● You can email me!
Corey.chivers@mail.mcgill.ca
● I blog about statistics:
bayesianbiologist.com
● I tweet about statistics:
@cjbayesian