Applied Statistics Part 1: Introduction to Distributions, Measures of Variability, and Confidence Intervals

Applied Statistics
Part 1
By
M. H. Farjoo MD, PhD
Shahid Beheshti University of Medical Sciences
Instagram: @bio_animation

Applied Statistics
part 1
 Introduction
 Normal (Gaussian) Distribution
 Standard Deviation
 Standard Error of the Mean
 Confidence Interval of the Mean

Which one do you hate most?
Cockroaches or statistics?

Introduction
 There are three kinds of lies: Lies, Damn Lies, and Statistics!
(Mark Twain)
 I can prove anything by statistics - except the truth. (George
Canning)
 It is true we can tell a lie easily with statistics, yet is easier to
lie without it!
 It is for deciding about a population with as accuracy as
possible.
 … and no one claims it is error free.

Introduction (Cont,d)
 By statistics we try to extrapolate variables from
“sample” to “population”.
 Kinds of variables:
 Categorical, Nominal, qualitative
 Measurement, Numeric, Quantitative
 Continuous
 Discrete
 Ordinal (differences between values are not the same)
 Interval (differences between values are the same)
 Ratio (similar to interval, but also has a clear 0.0)

Normal (Gaussian) Distribution
 The experimenters think it can be proved by
mathematics; and the mathematicians, believe it has
been established by observation. (W. Lippmann)
 Named after Carl Friedrich Gauss, 19th century
German mathematician.
 It underlies the assumption of many statistical tests.
 The distribution emerges when many independent
random factors act in an additive manner to create
variability.

Normal (Gaussian) Distribution (Cont,d)
μ (mu) is the mean
of the population
σ (sigma) is the
standard deviation
of the population

10 ml pipetting, 1 time, repeated 1000 time

10 ml pipetting, 2 time, repeated 1000 time

10 ml pipetting, 10 times, repeated 1000 time

10 ml pipetting, 10 time, repeated 15,000 times

Standard Deviation (SD)
 Standard deviation (SD) is the variability or scatter of
the numbers compared to their mean.
 SD is a number that tells you how far numbers are
from their mean.
 Obviously, the higher is SD, the more observations
needed to make results reasonable.
 SD of a sample (s) is always greater than the SD of
its related population (σ). why?

 The data may have the same mean yet their pattern is
different.
 The data may have the same mean and SD, yet their
pattern is different!
 The unit of SD is similar to the unit of the variable in
question.
 This makes its interpretation easier, compared to the
variance.
 What is variance and Is it useful? Variance is SD
squared and it is not useful!

Hands-on practice
 To calculate Mean and SD in Excel:
 For SD of a sample: =STDEV.S(number1,[number2],...)
 For Mean of a sample: =AVERAGE(number1,[number2],...)
 To calculate Mean and SD in SPSS:
 Analyze => Descriptive Statistics => Frequencies => Statistics =>
Mean & Standard Deviation check boxes
 Analyze => Descriptive Statistics => Descriptives => Options =>
Mean & Standard Deviation check boxes
 Analyze => Descriptive Statistics => Explore => Statistics =>
Descriptive check box
 To calculate Mean and SD in Prism:
 Analyze => Column statistics => Mean, SD, SEM check box

Standard Error of The Mean (SEM)

Individual observations (X's) and means (red dots) for random
samples from a population with a parametric mean of 5 (blue line)

Means ±1 standard deviation of 100 random samples (N=3) from a
population with a parametric mean of 5 (blue line).
Note that there are 100 means, NOT 100 observations. The “X”s represent
the mean of 3 observations.
The calculated standard
deviation of 100 sample
means is 0.63

Means ±1 standard error of 100 random samples (N=3) from a population
with a parametric mean of 5 (blue line).

Means ±1 standard error of 100 random samples (N=20) from a population
with a parametric mean of 5 (blue line).

 The SEM quantifies how precisely you know the true
mean of the population.
 It is a measure of how far your sample mean is likely to
be from the true population mean.
 SEM =
𝑆𝐷
𝑁
 The higher is SD, the less precise is your estimation of
the population mean.
 The SEM is always smaller than the SD. Why?

Hands-on practice
 To calculate SE in Excel:
 For SE of a sample: =STDEV/SQRT(COUNT(sampling
range))
 To calculate SE in SPSS:
 Analyze => Descriptive Statistics => Frequencies =>
Statistics => S.E. mean check box
 Analyze => Descriptive Statistics => Descriptives => Options
=> S.E. mean check box
 To calculate SE in Prism:
 Analyze => Column statistics => Mean, SD, SEM check box

Confidence Interval (CI)
 Statistically and mathematically CI and SEM are different,
but conceptually and practically they serve the same
purpose.
 How sure are you? confidence interval (CI) is the way to
answer this.
 The CI of a mean tells you how precisely you have
determined the mean.
 SEM is the probability about the difference between the
mean of the population and the mean of the sample.
 CI is directly related to SEM, so if CI includes ZERO, it
means: NO difference!

The X represent the mean of the sample, and the bars represent the SEM of
the sample. The red dot and red bars are samples which do not include the
mean of the population. the blue line is the mean of the population (which
we do NOT know).

 It is globally accepted to calculate 95% CI.
 A 95% CI is a range that you can be 95% certain
contains the true mean of the population.
 Don't misinterpret CI as the range that contains 95%
of the values!
 Is it possible that the CI of a mean does not include
the true mean?

The graph shows three samples (of different size) all sampled
from the same population
Confidence Interval (CI) 30

ten sets of data (N=5), from a Gaussian distribution
with a mean of 100 and a standard deviation of 35

95% CI of the mean for each sample.

 A common rule-of-thumb is that the 95% CI is
computed from the mean ± 2 SEMs.
 So you may roughly double the size of the SEM error
bars, to represent them as CI error bars. Why?
 With large samples, the rule is accurate, with small
ones, the CI is much wider than anticipated by this
rule.

Because for calculating CI, the constant of SE is 1.96 (almost 2)
Teacher!
I do NOT like
formulas

 We can express the precision of any computed value
as a 95% CI (eg: CI of a slope for the best-fit value,
CI of SD).
 There is a myth that when two means have
overlapping CIs, the means are not significantly
different.
 Another version is: if each mean is outside the CI of
another mean, the means are significantly different.
 Neither of these is true!

 It is easy for two sets of data to have overlapping CIs,
yet still be significantly different.
 Conversely, each mean can be outside the confidence
interval of the other, yet they're still not significantly
different.
 Do not compare two means by visually comparing
their confidence intervals, just use the correct
statistical test.

 The error bars may be asymmetrical.
 This is especially true with nominal variables eg: the
number of cigarettes smoked, or the number of color
blind men.
 In these cases a zero or negative number makes no sense.
 We know this because there are some occurrences of the
variable in the population.
 The calculation method of CI is different if this is the
case.

Hands-on practice
 To calculate CI in Excel:
 For normal distribution:
=CONFIDENCE.NORM(alpha,standard_dev,size)
 To calculate CI in SPSS:
 Analyze => Descriptive Statistics => Explore => Statistics =>
Descriptive check box
 Analyze => Compare means => One sample T Test =>
Options
 To calculate CI in Prism:
 Analyze => Column statistics => CI of the Mean check box

1.6: Introduction to Plots
• A plot(graphs) is a graphical technique for
representing a data set.
• Graphs are a visual representation of the
variables and relationship between
variables.
• Plots are very useful for humans who can
quickly derive an understanding which
would not come from lists of values.

Charts
Pie Chart
6/23/2009 Arsia Jamali-Students' Scientific
Research Center
60
Disrtribution of Stage of the
Pancreatic Cancer In Patients
IV
III
II
I
Bar Chart
Disrtribution of Stage of the Pancreatic
Cancer In Patients
0
20
40
60
80
100
120
IV III II I
IV
III
II
I

Research Center
61
Charts
Histogram
6/23/2009
Arsia Jamali-Students' Scientific
Research Center
61
Area

Research Center
62
6/23/2009
Research Center
62
Charts
Box Plot
6/23/2009
Research Center
62
Error Bar

Research Center
63
Charts
Clustered Bar
6/23/2009
Research Center
63
Scatter Plot
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 10 20 30 40 50
Birth
Weight
Gestational Week
Disrtribution of Stage of the Pancreatic Cancer According
To Age In Patients
0
10
20
30
40
50
60
70
IV III II I
105 80 20 10
Stage
Number
of
The
Patients
Male
Female

Significance of Cluster Bar
0%
5%
10%
15%
20%
25%
30%
Stage I Stage II Stage
III
Stage
IV
Distributaion of Stage
in Pancreatic Cancer
Pateints
0%
10%
20%
30%
40%
50%
60%
Male Female
Distribuation of Gender
in Pancreatic Cancer
Pateints
Research Center
64
6/23/2009

Significance of Cluster Bar
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
Stage I Stage II Stage III Stage IV
Male
Female
Research Center
3/5/2022 65

Applied Statistics Part 1: Introduction to Distributions, Measures of Variability, and Confidence Intervals

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Applied Statistics Part 1: Introduction to Distributions, Measures of Variability, and Confidence Intervals

Similar to Applied Statistics Part 1: Introduction to Distributions, Measures of Variability, and Confidence Intervals (20)

More from Mohammad Hadi Farjoo MD, PhD, Shahid behehsti University of Medical Sciences

More from Mohammad Hadi Farjoo MD, PhD, Shahid behehsti University of Medical Sciences (20)

Recently uploaded

Recently uploaded (20)

Applied Statistics Part 1: Introduction to Distributions, Measures of Variability, and Confidence Intervals