3. +
Statistics
Definition
The practice or science of collecting and analyzing numerical data in large
quantities, especially for the purpose of inferring proportions in a whole from
those in a representative sample.
Descriptive statistics
Describe the basic features of data in a study
Provide summaries about the sample and measures
Inferential statistics
Investigate questions, models, and hypotheses
Infer population characteristics based on sample
Make judgments about what we observe
4. +
Some Concepts
Variable - any characteristic of an individual or entity. A variable can take
different values for different individuals. Variables can be categorical or
quantitative.
• Nominal - Categorical variables with no inherent order or ranking sequence such as
names or classes (e.g., gender). Value may be a numerical, but without numerical value (e.g.,
I, II, III). The only operation that can be applied to Nominal variables is enumeration.
• Ordinal - Variables with an inherent rank or order, e.g. mild, moderate, severe. Can be
compared for equality, or greater or less, but not how much greater or less.
• Interval - Values of the variable are ordered as in Ordinal, and additionally, differences
between values are meaningful, however, the scale is not absolutely anchored. Calendar
dates and temperatures on the Fahrenheit scale are examples. Addition and subtraction, but
not multiplication and division are meaningful operations.
• Continuous- Variables with all properties of Interval plus an absolute, non-arbitrary zero
point, e.g. age, weight, temperature (Kelvin). Addition, subtraction, multiplication, and division
are all meaningful operations.
5. +
Sampling
What is your population of interest?
To whom do you want to generalize your results?
All students (18 and over)
Undergraduates only
Arts students
Athletes
Other
Can you sample the entire population?
A sample is “a smaller (but hopefully representative) collection of
units from a population used to determine truths about that
population” (Field, 2005)
Why sample?
Resources (time, money) and workload
Gives results with known accuracy that can be calculated mathematically
6. +
6
Descriptive Statistics
Descriptive statistics are used to summarize or condense a group of
scores
They include measures of central tendency and measures of
variability
Mode
Median
Mean
Range
Variance
Standard Deviation
7. +
7
Central Tendency
Measures of central tendency describe the average or common score
of a group of scores
Common measures of central tendency include the mean, median,
and mode
8. +
8
Mean
The mean is the arithmetic average of the scores
The calculation of the mean considers both the number of scores and
their value
The formula for the mean of the variable X is:
9. +
9
Mean
Six men with high serum cholesterol participated in a study to examine
the effects of diet on cholesterol
At the beginning of the study, their serum cholesterol levels (mg/dL)
were:
366, 327, 274, 292, 274, 230
What is the mean?
10. +
10
Median
The median is the middle point in an ordered distribution at which an
equal number of scores lie on each side of it
It is also known as the 50th percentile (P50), or 2nd quartile (Q2)
The position of the median (Mdn) can be calculated as follows:
11. +
11
Median
Example: Calculate the median for the following measurements for height:
71”, 73”, 74”, 75”, 72”
Step Two: Calculate the position of the median using the following formula:
Step Three: Determine the value of the median by counting from either the
highest or the lowest score until the desired score is reached (in this case the
3rd score)
12. +
12
Median
Suppose that in our previous distribution we had a sixth score as
follows:
71”, 72”, 73”, 74”, 74”, 75”
What are the position and value of the median?
?
13. +
13
Median
Consider the following example: Nine people each perform 40 sit-ups,
and one does 1,000
The median score for the group is 40, and the mean (arithmetic average)
is 136
The median would still be 40 even if the highest score were 2,000
instead of 40
What can you learn from this?
The Median is Unaffected by Extreme Scores
14. +
14
Mode
The mode is the most frequently occurring score
Which of the following scores is the mode?
3, 7, 3, 9, 9, 3, 5, 1, 8, 5
Similarly, for another data set (2, 4, 9, 6, 4, 6, 6, 2, 8, 2), there are two
modes; What are they?
15. +
15
Mode
A distribution with a single mode is said to be unimodal
A distribution with more than one mode is said to be bimodal,
trimodal, etc., or in general, multimodal
16. +
16
Variability
Measures of variability describe the extent of similarity or difference in
a set of scores
These measures include the range, standard deviation, and
variance
17. +
17
Standard Deviation (SD)
Standard Deviation – a measure of the variability, or spread, of a set
of scores around the mean
Intuitively, the sum of the differences between each score and the mean
(known as deviation scores) appears to be a good approach for
measuring variability around the mean
19. 19
SD
Now let’s calculate the sum of the deviation scores:
= (1-6) + (2-6) + (6-6) + (6-6) + (15-6)
= (-5) + (-4) + (0) + (0) + (9)
= = -9 + 9 = 0
20. +
20
SD
We can avoid this problem (deviation scores sum to 0) by
squaring each deviation score before summing them
This would be written symbolically as
Substituting our X scores again,
= (1-6)2 + (2-6)2 + (6-6)2 + (6-6)2 + (15-6)2
= (-5)2 + (-4)2 + (0)2 + (0)2 + (9)2
= 25 + 16 + 0 + 0 + 81
= 122
21. +
21
SD
We then divide this value by n-1 to arrive at the mean squared
deviation
122/4 = 30.5
We then take the square root of this value to bring the units
back to the raw score units
22. +
22
Variance
The variance is the square of the standard deviation
It is used most commonly with more advanced statistical procedures
such as regression analysis, analysis of variance (ANOVA), and the
determination of the reliability of a test
The variance show of how far each value in the data set is from
the mean. Here is how it is defined:
23. + Example calculation of variance and standard deviation on strength scores.
Subj Score (x) Deviation (x)2
1 216 22.7 515.29
2 144 -49.3 2430.49
3 183 -10.3 106.09
4 138 -55.3 3058.09
5 212 18.7 349.69
6 180 -13.3 176.89
7 200 6.7 44.89
8 264 70.7 4998.49
9 203 9.7 94.09
=1740 =0 =11774.01
3
.
193
9
1740
n
X
=
X
s
x X
n
2
2
1
1177401
8
147175
( ) .
.
s
x X
n
( )
.
2
1
384
24. +
24
Range
The range of a set of data is the difference between the highest
and lowest values in the set. To find the range, first order the
data from least to greatest. Then subtract the smallest value
from the largest value in the set.
Example: In {4, 6, 9, 3, 7} the lowest value is 3, and the highest is 9.
So the range is 9-3 = 6.
25. +
Quantiles
one of the class of values of a variable that divides
the total frequency of a sample or population into a
given number of equal proportions
Examples:
Percentile
Decile
Quartile
Quintile
26. +
Quantiles
The 100-quantiles are called percentiles.
The 10-quantiles are called deciles.
The 5-quantiles are call quintiles.
The 4-quantiles are called quartiles.
27. +
Percentiles
The kth percentile is a scale value for a data series equal to the
p/100 quantile
The 1st percentile cuts off lowest 1% of data
The 98th percentile cuts off lowest 98% of data
The 25th percentiles is the first quartile
The 50th percentile is the median
28. +
Deciles
each of ten equal groups into which a population can be
divided according to the distribution of values of a particular
variable.
Represents 1/10 of the total population
The 1st decile cuts off the lowest 10% of data
The 9th decile cuts off the lowest 90% of data
29. +
Quartiles
The quartiles divide the distribution into four
equal parts
called fourths
The total of 100% is broken into four equal parts: 25%,
50%, 75%, 100%.
Lower Quartile is the 25th percentile. (0.25)
Median Quartile is the 50th percentile. (0.50)
Upper Quartile is the 75th percentile. (0.75)
30. +
Quintiles
any of five equal groups into which a population can be divided
according to the distribution of values of a particular variable.
Represents 20% or 1/5 of the given amount
31. +
Box Plot
A visual tool that illustrates the distribution of a
univariate dataset.
It illustrates the median, upper and lower
quantiles, upper and lower deciles, and any
outliers.
Using R
boxplot(dataset)
quantile(dataset)
33. Practice
One hundred randomly selected students were asked the number of movies they watched
the previous week. The results are as follows:
Find the sample mean, median, and range of the sample.
Find the standard deviation and the variance.
Construct a barplot of the data.
Find the first quartile.
Find the second quartile. To which value it corresponds?
Find the third quartile.
Construct a box plot of the data.
What percent of the students saw fewer than three movies?
Find the 40th percentile.
Find the 90th percentile.
# of Movies Frequency
0 20
1 36
2 24
3 16
4 4