2. You should be familiar with following
• Mean (medelvärde), for a normal distribution
• Median (median)
• Mode (typvärde)
• Line chart (linjediagram)
• Bar chart (stapeldiagram)
Petri Lankoski, 2018 2
3. Is the Die Loaded?
11st throw
12st throw
43st throw
14st throw
25st throw
We cannot say for certain, but we can estimate how
likely or unlikely the perceived sequence is
In long run we expect to see equal amount of 1s, 2s,
3s, 4s, 5s and 6s
16st throw
Chance to get 1 is 1/6, but as first throw, this is as
likely as any other result. We do not have enough
information to say anything more about this
six throws is probably still too little to estimate the
die, so we would need to roll more…
Petri Lankoski, 2018 3
4. Is the Die Loaded?
1
1
4
1
2
1
3
6
1
1
1
5
Testing this sequence against expected sequence
indicate that the die is loaded
• But we have around 1% change to be wrong
We roll following sequence: 2 6 2 6 6 4 6 5 4 1 3 4
4 6 5 3 5 3 2 5
• Amounts of 6s and 1s does not match to
expected amounts
• We would have 70% likelihood of being wrong
if we claim that the die is load
Petri Lankoski, 2018 4
6. density and violin plot
Violin plot is a form
of density plot
Petri Lankoski, 2018 6
Density plot and data points
7. Scatter plot
-2 -1 0 1 2
-3-2-10123
Variable 1
Variable2
Scatter plot shows values of two variables
• For example how a participant answered
to questions
Petri Lankoski, 2018 7
8. Random sampling Predicting election results
- It is not practically possible to ask all what they will vote
- Picking a sample of people randomly & asking them
However, we know that there is uncertainty here
If random sample again, we might get something else
We get:
A: 37.6%
B: 12.3%
C: 33.1%
D: 5.2%
…
We get:
A: 36.9%
B: 13.0%
C: 32.7%
D: 6.1%
…
We can estimate uncertainty, but we need to make some
assumptions
Petri Lankoski, 2018
8
We get:
A: 38.7%
B: 11.0%
C: 31.7%
D: 6.3%
…
10. Back to polling
1.96𝜎-1.96𝜎 0𝜎
95% of population is in the
area of ∓1.96𝜎; sample
distribution behaves similarly
However, within 95% certainty
what we observed falls in area
between -1.96𝜎 and 1.96𝜎.
We cannot know where in
population distribution what
we observed was (red vertical
lines).
10
We do not know true
population value (black
vertical line).
Support for A
36.1%
38.7%
37.6%
Probability
Percentage of people voting A
11. Random sampling
Instead of uncertainty, confidence is usually used.
Confidence interval (CI), usually 95%, is function of sample
size and probability of someone choosing a candidate.
0.376 ∓ 1.96 ∗ √
0.376(1 − 0.376)
𝑁
𝜎95%A
Petri Lankoski, 2018 11
We can backtrack from the sample distribution and estimate
the uncertainty in what we observed when polling
• When we poll next time within 95% certainty what we
observed falls in area between -1.96𝜎 and 1.96𝜎
12. Are two means different, t-test?
A B∆
We have two sample means A and B
Their difference is ∆=B-A
Mean is calculated based on sampled values
Mean(A) =
∑𝑎
𝑛
(for normally distruted variables)
To extrapolate if the there is difference between
groups A and B in population level (from witch A and
B were sampled) we need to account uncertainty.
Again population mean and sample mean can be
different.
Petri Lankoski, 2018 12
13. Are two means different, t-test?
A B∆
We have two sample means A and B
Their difference is ∆=B-A
t statistic describes difference so that it takes into
account variance (𝜎2) and sample size
p describes probability that perceived data deviates
from null hypothesis; in case null hypothesis of t-test, is
the means are not different.
p depends on t-value and sample size; high t-value
means lower p.
p = 0.05 means that there is 5% change that observed
data did not deviate from expected, there is no
difference. P<0.05 is a typical statistically significant
result criterion.
Petri Lankoski, 2018 13
14. Are tree means different, one-way ANOVA
• One-way ANOVA is similar to t-test
• F-statistic describes difference so that it takes into account variance
and sample size
• p describes probability that perceived data deviates from null
hypothesis; in case null hypothesis of ANOVA, is the means are not
different
• A significant result (p<0.05) tells that at least one mean differ from
others
• But not which
• Post hoc comparisons are needed to determine which variable differs from
which
Petri Lankoski, 2018 14
15. Correlation
Correlation (r) describes the strength of
association between two variables
• Negative correlation means negative
association: when V1 increases V2 decreases
p describes the likelihood that the observed
correlation deviates from what is expected
under null hypothesis (which is that there is no
relation between the two variables)
Correlation does not tell if v1 causes v2 or vice
versa
• There is a correlation between ice cream
sales and drowning
• Either is causing another
• A third variable, temperature, can explain to
both
Petri Lankoski, 2018 15
Editor's Notes
https://stats.stackexchange.com/questions/3194/how-can-i-test-the-fairness-of-a-d20/3735#3735
chisq.test(table(c(1,1,4,1,2,1,3,6,1,1,1,5)), p = rep(1/6,6))
Chi-squared test for given probabilities
data: table(c(1, 1, 4, 1, 2, 1, 3, 6, 1, 1, 1, 5))
X-squared = 15, df = 5, p-value = 0.01036
# vs non-biased dice
rolls = sample(1:6, 20, replace=TRUE) # 20 times d6
chisq.test(table(rolls), p = rep(1/6,6))
Polling is done via random sampling using telephone catalog. However, people owning a phone and people voting are not the same populations and the poll results are systematically off; however, there are techniques counter the sampling bias, especially in the case of voting when it is possible to compare results to poll results.
𝜎=standard deviation, describes the width of distribution
Black vertical line: population value
Red vertical line: sample values
𝜎=standard deviation, describes the width of distribution
Black vertical line: population value
Red vertical line: sample values
The standard deviation is the square root of the variance.