Quantitative analysis: A brief introduction

Quantitative analysis
A brief introduction
Petri Lankoski, 2018 1

You should be familiar with following
• Mean (medelvärde), for a normal distribution
• Median (median)
• Mode (typvärde)
• Line chart (linjediagram)
• Bar chart (stapeldiagram)

Is the Die Loaded?
11st throw
12st throw
43st throw
14st throw
25st throw
We cannot say for certain, but we can estimate how
likely or unlikely the perceived sequence is
In long run we expect to see equal amount of 1s, 2s,
3s, 4s, 5s and 6s
16st throw
Chance to get 1 is 1/6, but as first throw, this is as
likely as any other result. We do not have enough
information to say anything more about this
six throws is probably still too little to estimate the
die, so we would need to roll more…

Is the Die Loaded?
1
1
4
1
2
1
3
6
1
1
1
5
Testing this sequence against expected sequence
indicate that the die is loaded
• But we have around 1% change to be wrong
We roll following sequence: 2 6 2 6 6 4 6 5 4 1 3 4
4 6 5 3 5 3 2 5
• Amounts of 6s and 1s does not match to
expected amounts
• We would have 70% likelihood of being wrong
if we claim that the die is load

Boxplot
Median
IQR,
50% of data
1.5 * IQR

density and violin plot
Violin plot is a form
of density plot
Density plot and data points

Scatter plot
-2 -1 0 1 2
-3-2-10123
Variable 1
Variable2
Scatter plot shows values of two variables
• For example how a participant answered
to questions

Random sampling Predicting election results
- It is not practically possible to ask all what they will vote
- Picking a sample of people randomly & asking them
However, we know that there is uncertainty here
If random sample again, we might get something else
We get:
A: 37.6%
B: 12.3%
C: 33.1%
D: 5.2%
…
We get:
A: 36.9%
B: 13.0%
C: 32.7%
D: 6.1%
…
We can estimate uncertainty, but we need to make some
assumptions
Petri Lankoski, 2018
8
We get:
A: 38.7%
B: 11.0%
C: 31.7%
D: 6.3%
…

Normal distribution
1𝜎 2𝜎-2𝜎 -1𝜎 0𝜎
68.3%
95.4% of data
9
𝜎 = standard deviation
• describes the width of distribution

Back to polling
1.96𝜎-1.96𝜎 0𝜎
95% of population is in the
area of ∓1.96𝜎; sample
distribution behaves similarly
However, within 95% certainty
what we observed falls in area
between -1.96𝜎 and 1.96𝜎.
We cannot know where in
population distribution what
we observed was (red vertical
lines).
10
We do not know true
population value (black
vertical line).
Support for A
36.1%
38.7%
37.6%

Random sampling
Instead of uncertainty, confidence is usually used.
Confidence interval (CI), usually 95%, is function of sample
size and probability of someone choosing a candidate.
0.376 ∓ 1.96 ∗ √
0.376(1 − 0.376)
𝑁
𝜎95%A
We can backtrack from the sample distribution and estimate
the uncertainty in what we observed when polling
• When we poll next time within 95% certainty what we
observed falls in area between -1.96𝜎 and 1.96𝜎

Are two means different, t-test?
A B∆
We have two sample means A and B
Their difference is ∆=B-A
Mean is calculated based on sampled values
Mean(A) =
∑𝑎
𝑛
(for normally distruted variables)
To extrapolate if the there is difference between
groups A and B in population level (from witch A and
B were sampled) we need to account uncertainty.
Again population mean and sample mean can be
different.

Are two means different, t-test?
A B∆
We have two sample means A and B
Their difference is ∆=B-A
t statistic describes difference so that it takes into
account variance (𝜎2) and sample size
p describes probability that perceived data deviates
from null hypothesis; in case null hypothesis of t-test, is
the means are not different.
p depends on t-value and sample size; high t-value
means lower p.
p = 0.05 means that there is 5% change that observed
data did not deviate from expected, there is no
difference. P<0.05 is a typical statistically significant
result criterion.

Are tree means different, one-way ANOVA
• One-way ANOVA is similar to t-test
• F-statistic describes difference so that it takes into account variance
and sample size
• p describes probability that perceived data deviates from null
hypothesis; in case null hypothesis of ANOVA, is the means are not
different
• A significant result (p<0.05) tells that at least one mean differ from
others
• But not which
• Post hoc comparisons are needed to determine which variable differs from
which

Correlation
Correlation (r) describes the strength of
association between two variables
p describes the likelihood that the observed
correlation deviates from what is expected
under null hypothesis (which is that there is no
relation between the two variables)
Correlation does not tell if v1 causes v2 or vice
versa
• There is a strong correlation between ice
cream sales and drowning
• Either is causing another
• Third variable, temperature, related to both

Quantitative analysis: A brief introduction

More Related Content

What's hot

Similar to Quantitative analysis: A brief introduction

More from Petri Lankoski

Recently uploaded

Quantitative analysis: A brief introduction

Editor's Notes