Introduction to statistics RSS6 2014

Introduction to Statistics
Amr Albanna, MD, MSc

Content
• Scales of Measurement
– Categorical Variables
– Numerical Variables:
• Displays of Categorical Data
– Frequencies
– Bar Graph
– Pie Chart
• Numerical Measures of Central Tendency
– Mean
– Median
– Mode
• Numerical Measures of Spread
• Association
• Correlation
• Regression

Scales of Measurement
• Categorical Variables:
– Nominal: Categorical variable with no order (e.g. Blood
type A, B, AB or O).
– Ordinal: Categorical, but with an order (e.g. Pain: “none",
“mild", “moderate", or “severe").
• Numerical Variables:
– Interval: Quantitative data where differences are
meaningful (e.g. Years 2009 -2010.). Here differences are
meaningful; ratios are not meaningful.
– Ratio: Quantitative data where ratios are meaningful (e.g.
weights, 200 lbs is twice as heavy as 100 lbs).

Categorical Variables
• Displays of Categorical Data
– Frequencies
– Bar Graph
– Pie Chart

Categorical Variables
Variable (Sex) Frequency Proportion
Male 609 0.61
Female 391 0.39
Total 1000 100
0
100
200
300
400
500
600
700
Male Female
Bar Graph
Pie Chart

Numerical Variables
Central Tendency
Numerical Spread

Measures of Central Tendency
• The 3 M's
– Mean
– Median
– Mode

Measures of Central Tendency
Sample Mean
The sample mean, 𝑥, is the sum of all values in the
sample divided by the total number of observations,
n, in the sample.
𝑥 =
𝑥𝑖
𝑛
𝑖=1
𝑛

Example: Sample Mean
• Mean systolic blood
pressure
Scenario 1:
Mean = (120 + 135 + 115 +
110 + 105 + 140)/6
=121
Subjects BP
1 120 (x1)
2 135 (x2)
3 115 (x3)
4 110 (x4)
5 105 (x5)
6 140 (x6)

Sample Mean
• The mean is affected by extreme observations
and is not a resistant measure.
Scenario 2:
Mean = (120 + 135 + 115 + 110 +
105 + 140 + 280)/7 =144
Subjects BP
1 120 (x1)
2 135 (x2)
3 115 (x3)
4 110 (x4)
5 105 (x5)
6 140 (x6)
7 280 (x7)

Median
• The sample median, M, is the number such
that “half" the values in the sample are
smaller and the other “half" are larger.
• Use the following steps to find M.
– Sort the data (arrange in increasing order).
– Is the size of the data set n even or odd?
– If odd: M = value in the exact middle.
– If even: M = the average of the two middle
numbers.

Example: Sample Median
• Median systolic BP:
Scenario 1:
120 : 135 : 115 : 110 : 105 : 140
Median = (115 + 110) /2 = 112.5
Scenario 2:
120 : 135 : 115 : 110 : 105 : 140 : 280
Median = 110
• The median is not affected by extreme
observations and is a resistant measure.

Mode
• The sample mode is the value that occurs
most frequently in the sample (a data set can
have more than one mode).
• This is the only measure of center which can
also be used for categorical data.
• The population mode is the highest point on
the population distribution.

Symmetric Data Distribution
0
1
2
3
4
5
6
10 20 30 40 50
Frequency
Value

Rightward Skewness of Data
0
1
2
3
4
5
6
10 20 30 40 50
Mode
Frequency
Value
Median Mean

Leftward Skewness of Data
0
1
2
3
4
5
6
10 20 30 40 50
Mean Median Mode
Value
Frequency

Numerical Measures of Spread
• Range
• Sample Variance
• Inter Quartile Range (IQR)

Range: The range of the data set is the
difference between the highest value and the
lowest value.
– Range = highest value - lowest value
– Easy to compute BUT ignores a great deal of
information.
– Obviously the range is affected by extreme
observations and is not a resistant measure.

• Variance: equal to the sum of squared deviations
from the sample mean divided by n - 1, where n is
the number of observations in the sample.

• Percentile: The percentile of a distribution is
the value at which observations fall at or
below it.

• The most commonly used percentiles are the
quartiles.
1st quartile Q1 = 25th percentile.
2nd quartile Q2 = 50th percentile.
3rd quartile Q1 = 75th percentile.

Inter Quartile Range (IQR)
A simple measure spread giving the range covered
by the middle half of the data is the (IQR) defined
below.
IQR = Q3 - Q1
The IQR is a resistant measure of spread.

Outliers: extreme observations that fall well
outside the overall pattern of the distribution.
• An outlier may be the result of a
– Recording error,
– An observation from a different population,
– An unusual extreme observation (biological
diversity)

Association Between Variables
• Explanatory (exposure) variable “X”
• Response (outcome) variable “Y”

Correlation is NOT Association

Introduction to statistics RSS6 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to statistics RSS6 2014

Similar to Introduction to statistics RSS6 2014 (20)

More from RSS6

More from RSS6 (11)

Recently uploaded

Recently uploaded (20)

Introduction to statistics RSS6 2014