Descriptives & Graphing

Lecture 3
Survey Research & Design in Psychology
James Neill, 2017
Creative Commons Attribution 4.0
Descriptives & Graphing
Image source: http://commons.wikimedia.org/wiki/File:3D_Bar_Graph_Meeting.jpg

2
Overview:
Descriptives & Graphing
1. Getting to know a data set
2. LOM & types of statistics
3. Descriptive statistics
4. Normal distribution
5. Non-normal distributions
6. Effect of skew on central tendency
7. Principles of graphing
8. Univariate graphical techniques

3
Getting to know
a data-set
(how to approach data)

4
Play with the data –
get to know it.Image source: http://www.flickr.com/photos/analytik/1356366068/

5
Don't be afraid - you
can't break data!Image source: http://www.flickr.com/photos/rnddave/5094020069

6
Check & screen the data –
keep signal, reduce noise
Image source: https://commons.wikimedia.org/wiki/File:Nasir-al_molk_-1.jpg

7
Data checking: One
person reads the survey responses
aloud to another person who checks
the electronic data file.
For large studies, check a
proportion of the surveys
and declare the error-rate
in the research report.
Image source: http://maxpixel.freegreatpicture.com/Business-Team-Two-People-Meeting-Computers-Office-1209640

8
Data screening: Carefully
'screening' a data file helps to
remove errors and maximise validity.
For example, screen for:
Out of range values
Mis-entered data
Missing cases
Duplicate cases
Missing data
Image source: https://commons.wikimedia.org/wiki/File:Archaeology_dirt_screening.jpg

9
Explore the data
Image source: https://commons.wikimedia.org/wiki/File:Kazimierz_Nowak_in_jungle_2.jpgI

10
Get intimate
with the data
Image source: http://www.flickr.com/photos/elmoalves/2932572231/

11
Describe the data's
main features
find a
meaningful,
accurate
way to
depict the
‘true story’ of
the data
Image source: http://www.flickr.com/photos/lloydm/2429991235/

12
Test hypotheses
to answer research questions
Image source: https://pixabay.com/en/light-bulb-current-light-glow-1042480/

13
Level of
measurement &
types of statistics
Image source: http://www.flickr.com/photos/peanutlen/2228077524/

14
Golden rule of data analysis
A variable's level of
measurement determines the
type of statistics that can be
used, including types of:
• descriptive statistics
• graphs
• inferential statistics

15
Levels of measurement and
non-parametric vs. parametric
Categorical & ordinal data DV
→ non-parametric
(Does not assume a normal distribution)
Interval & ratio data DV
→ parametric
(Assumes a normal distribution)
→ non-parametric
(If distribution is non-normal)
DVs = dependent variables

16
Parametric statistics
• Statistics which estimate
parameters of a population, based
on the normal distribution
–Univariate:
mean, standard deviation, skewness,
kurtosis, t-tests, ANOVAs
–Bivariate:
correlation, linear regression
–Multivariate:
multiple linear regression

17
• More powerful
(more sensitive)
• More assumptions
(population is normally distributed)
• Vulnerable to violations of
assumptions
(less robust)
Parametric statistics

18
Non-parametric statistics
• Statistics which do not assume
sampling from a population which
is normally distributed
–There are non-parametric alternatives for
many parametric statistics
–e.g., sign test, chi-squared, Mann-
Whitney U test, Wilcoxon matched-pairs
signed-ranks test.

19
Non-parametric statistics
• Less powerful
(less sensitive)
• Fewer assumptions
(do not assume a normal distribution)
• Less vulnerable to assumption
violation
(more robust)

20
Univariate
descriptive
statistics

21
Number of variables
Univariate
= one variable
Bivariate
= two variables
Multivariate
= more than two variables
mean, median, mode,
histogram, bar chart
correlation, t-test,
scatterplot, clustered bar
chart
reliability analysis, factor
analysis, multiple linear
regression

22
What do we want to describe?
The distributional properties of
variables, based on:
● Central tendency(ies): e.g.,
frequencies, mode, median, mean
● Shape: e.g., skewness, kurtosis
● Spread (dispersion): min., max.,
range, IQR, percentiles, variance,
standard deviation

23
Measures of central tendency
Statistics which represent the
‘centre’ of a frequency distribution:
–Mode (most frequent)
–Median (50th
percentile)
–Mean (average)
Which ones to use depends on:
–Type of data (level of measurement)
–Shape of distribution (esp. skewness)
Reporting more than one may be
appropriate.

24
Measures of central tendency
√√If meaningfulRatio
√√√Interval
√Ordinal
√Nominal
MeanMedianMode /
Freq. /%s
If meaningful
x x
x

25
Measures of distribution
• Measures of shape, spread,
dispersion, and deviation from the
central tendency
Non-parametric:
• Min. and max.
• Range
• Percentiles
Parametric:
• SD
• Skewness
• Kurtosis

26
√√√Ratio
√√Interval
√Ordinal
Nominal
Var / SDPercentileMin / Max,
Range
Measures of spread /
dispersion / deviation
If meaningful
√
x x x
x

27
Descriptives for nominal data
• Nominal LOM = Labelled categories
• Descriptive statistics:
–Most frequent? (Mode – e.g., females)
–Least frequent? (e.g., Males)
–Frequencies (e.g., 20 females, 10 males)
–Percentages (e.g. 67% females, 33% males)
–Cumulative percentages
–Ratios (e.g., twice as many females as males)

28
Descriptives for ordinal data
• Ordinal LOM = Conveys order but
not distance (e.g., ranks)
• Descriptives approach is as for
nominal (frequencies, mode etc.)
• Plus percentiles (including median)
may be useful

29
Descriptives for interval data
• Interval LOM = order and
distance, but no true 0 (0 is
arbitrary).
• Central tendency (mode, median,
mean)
• Shape/Spread (min., max., range,
SD, skewness, kurtosis)
Interval data is discrete, but is often treated as
ratio/continuous (especially for > 5 intervals)

30
Descriptives for ratio data
• Ratio = Numbers convey order
and distance, meaningful 0 point
• As for interval, use median,
mean, SD, skewness etc.
• Can also use ratios (e.g., Category A is
twice as large as Category B)

31
Mode (Mo)
• Most common score - highest point in a
frequency distribution – a real score – the most
common response
• Suitable for all levels of data, but
may not be appropriate for ratio (continuous)
• Not affected by outliers
• Check frequencies and bar graph
to see whether it is an accurate
and useful statistic

32
Frequencies (f) and
percentages (%)
• # of responses in each category
• % of responses in each category
• Frequency table
• Visualise using a bar or pie chart

33
Median (Mdn)
• Mid-point of distribution
(Quartile 2, 50th
percentile)
• Not badly affected by outliers
• May not represent the central
tendency in skewed data
• If the Median is useful, then
consider what other percentiles
may also be worth reporting

34
Summary: Descriptive statistics
• Level of measurement and
normality determines whether
data can be treated as parametric
• Describe the central tendency
–Frequencies, Percentages
–Mode, Median, Mean
• Describe the variability:
–Min., Max., Range, Quartiles
–Standard Deviation, Variance

35
Properties of the
normal distribution
Image source: http://www.flickr.com/photos/trevorblake/3200899889/

36
Four moments of a
normal distribution
Row 1 Row 2 Row 3 Row 4
0
2
4
6
8
10
12
Column 1
Column 2
Column 3
Mean
←SD→
-ve Skew +ve Skew
←Kurtosis→

37
Four moments of a
normal distribution
Four mathematical qualities
(parameters) can describe a
continuous distribution which at least
roughly follows a bell curve shape:
• 1st
= mean (central tendency)
• 2nd
= SD (dispersion)
• 3rd
= skewness (lean / tail)
• 4th
= kurtosis (peakedness / flattness)

38
Mean
(1st moment )
• Average score
Mean = Σ X / N
• For normally distributed ratio or
interval (if treating it as continuous) data.
• Influenced by extreme scores
(outliers)

39
Beware inappropriate averaging...
With your head in an oven
and your feet in ice
you would feel,
on average,
just fine
The majority of people have more
than the average number of legs
(M = 1.9999).

40
Standard deviation
(2nd moment)
• SD = square root of the variance
= Σ (X - X)2
N – 1
• For normally distributed interval or
ratio data
• Affected by outliers
• Can also derive the Standard Error
(SE) = SD / square root of N

41
Skewness
(3rd moment )
• Lean of distribution
– +ve = tail to right
– -ve = tail to left
• Can be caused by an outlier, or
ceiling or floor effects
• Can be accurate (e.g., cars owned per person
would have a skewed distribution)

42
Skewness (3rd moment)
(with ceiling and floor effects)
● Negative skew
● Ceiling effect
● Positive skew
● Floor effect
Image source http://www.visualstatistics.net/Visual%20Statistics%20Multimedia/normalization.htm

43
Kurtosis
(4th moment )
• Flatness or peakedness of
distribution
+ve = peaked
-ve = flattened
• By altering the X &/or Y axis, any
distribution can be made to look
more peaked or flat – add a normal
curve to help judge kurtosis visually.

44
Kurtosis
(4th moment )
Image source: https://classconnection.s3.amazonaws.com/65/flashcards/2185065/jpg/kurtosis-142C1127AF2178FB244.jpg

45
Judging severity of
skewness & kurtosis
• View histogram with normal curve
• Deal with outliers
• Rule of thumb:
Skewness and kurtosis > -1 or < 1 is
generally considered to sufficiently normal
for meeting the assumptions of parametric
inferential statistics
• Significance tests of skewness:
Tend to be overly sensitive
(therefore avoid using)

46
Areas under the normal curve
If distribution is normal
(bell-shaped - or close):
~68% of scores within +/- 1 SD of M
~95% of scores within +/- 2 SD of M
~99.7% of scores within +/- 3 SD of M

47
Areas under the normal curve
Image source: https://commons.wikimedia.org/wiki/File:Empirical_Rule.PNG

49
Types of non-normal distribution
• Modality
–Uni-modal (one peak)
–Bi-modal (two peaks)
–Multi-modal (more than two peaks)
• Skewness
–Positive (tail to right)
–Negative (tail to left)
• Kurtosis
–Platykurtic (Flat)
–Leptokurtic (Peaked)

51
Histogram of people's weight
WEIGHT
110.0100.090.080.070.060.050.040.0
Histogram
Frequency 8
6
4
2
0
Std. Dev = 17.10
Mean = 69.6
N = 20.00

52
Histogram of daily calorie intake
N = 75

54
Example ‘normal’ distribution
1
140120100806040200
Die
60
50
40
30
20
10
0
Frequency
Mean =81.21
Std. Dev. =18.228
N =188
At what age do you think you will die?

55
Example ‘normal’ distribution
2
Very masculineFairly masculineAndrogynousFairly feminineVery feminine
Femininity-Masculinity
60
40
20
0
Count
This bimodal graph
actually consists of two
different, underlying
normal distributions.

56
60
40
20
0
Count
Very masculineFairly masculineAndrogynousFairly feminine
50
40
30
20
10
0
Count
Gender: male
Distribution for females Distribution for males
60
40
20
0
Count
Gender: female

57
Non-normal distribution:
Use non-parametric descriptive statistics
• Min. & Max.
• Range = Max. - Min.
• Percentiles
• Quartiles
–Q1
–Mdn (Q2)
–Q3
–IQR (Q3-Q1)

58
Effects of skew on measures
of central tendency
+vely skewed distributions
mode < median < mean
symmetrical (normal) distributions
mean = median = mode
-vely skewed distributions
mean < median < mode

59
Effects of skew on measures of
central tendency

60
Transformations
• Converts data using various
formulae to achieve normality
and allow more powerful tests
• Loses original metric
• Complicates interpretation

61
1. If a survey question produces a
‘floor effect’, where will the mean,
median and mode lie in relation to
one another?
Review questions

62
2. Would the mean # of cars owned in
Australia to exceed the median?
Review questions

63
3. Would the mean score on an easy
test exceed the median
performance?
Review questions

64
Graphical
techniques
Image source: http://www.flickr.com/photos/pagedooley/2121472112/

65
Visualisation
“Visualization is any technique
for creating images, diagrams, or
animations to communicate a message.”
- Wikipedia
Image source: http://en.wikipedia.org/wiki/File:FAE_visualization.jpg

66
Science is
beautiful
(Nature Video)
(Youtube – 5:30 mins)
Image source:http://commons.wikimedia.org/wiki/File:Parodyfilm.png

67
Is Pivot a turning
point for web
exploration?
(Gary Flake)
(TED talk - 6 min.)
Image source:http://commons.wikimedia.org/wiki/File:Parodyfilm.png

68
Principles of graphing
• Clear purpose
• Maximise clarity
• Minimise clutter
• Allow visual comparison

69
Graphs
(Edward Tufte)
• Visualise data
• Reveal data
– Describe
– Explore
– Tabulate
– Decorate
• Communicate complex ideas
with clarity, precision, and
efficiency

70
Graphing steps
1. Identify purpose of the graph
(make large amounts of data coherent;
present many #s in small space;
encourage the eye to make comparisons)
2. Select type of graph to use
3. Draw and modify graph to be
clear, non-distorting, and well-
labelled (maximise clarity, minimise
clarity; show the data; avoid distortion;
reveal data at several levels/layers)

71
Software for
data visualisation (graphing)
1. Statistical packages
● e.g., SPSS Graphs or via Analyses
2. Spreadsheet packages
● e.g., MS Excel
3. Word-processors
● e.g., MS Word – Insert – Object –
Micrograph Graph Chart

72
Cleveland’s hierarchyImage source:http://www.processtrends.com/TOC_data_visualization.htm

73
Cleveland’s hierarchyImage source:https://priceonomics.com/how-william-cleveland-turned-data-visualization/

74
Univariate graphs
• Bar graph
• Pie chart
• Histogram
• Stem & leaf plot
• Data plot /
Error bar
• Box plot
Non-parametric
i.e., nominal or ordinal
}
} Parametric
i.e., normally distributed
interval or ratio

75
Bar chart (Bar graph)
AREA
Biology
Anthropology
Information Technolo
Psychology
Sociology
Count
13
12
12
11
11
10
10
9
9
AREA
Biology
Anthropology
Psychology
Sociology
Count
12
11
10
9
8
7
6
5
4
3
2
1
0
• Allows comparison of heights of bars
• X-axis: Collapse if too many categories
• Y-axis: Count/Frequency or % - truncation
exaggerates differences
• Can add data labels (data values for each bar)
Note
truncated
Y-axis

76
Biology
Anthropology
Psychology
Sociology
• Use a bar chart instead
• Hard to read
–Difficult to show
• Small values
• Small differences
–Rotation of chart and
position of slices
influences perception
Pie chart

77
Pie chart → Use bar chart instead
Image source: https://priceonomics.com/how-william-cleveland-turned-data-visualization/

78
Histogram
Participant Age
62.552.542.532.522.512.5
3000
2000
1000
0
Std. Dev = 9.16
Mean = 24.0
N = 5575.00
Participant Age
63.0
58.0
53.0
48.0
43.0
38.0
33.0
28.0
23.0
18.0
13.0
8.0
600
500
400
300
200
100
0
Std. Dev = 9.16
Mean = 24.0
N = 5575.00
Participant Age
65
61
57
53
49
45
41
37
33
29
25
21
17
13
9
1000
800
600
400
200
0
Std. Dev = 9.16
Mean = 24
N = 5575.00
• For continuous data (Likert?, Ratio)
• X-axis needs a happy medium for #
of categories
• Y-axis matters (can exaggerate)

79
Histogram of male & female heights
Wild & Seber (2000)
Image source: Wild, C. J., & Seber, G. A. F. (2000). Chance encounters: A first course in data analysis and inference. New York: Wiley.

80
Stem & leaf plot
● Use for ordinal, interval and ratio data
(if rounded)
● May look confusing to unfamiliar reader

81
• Contains actual data
• Collapses tails
• Underused alternative to histogram
Stem & leaf plot
Frequency Stem & Leaf
7.00 1 . &
192.00 1 . 22223333333
541.00 1 . 444444444444444455555555555555
610.00 1 . 6666666666666677777777777777777777
849.00 1 . 88888888888888888888888888899999999999999999999
614.00 2 . 0000000000000000111111111111111111
602.00 2 . 222222222222222233333333333333333
447.00 2 . 4444444444444455555555555
291.00 2 . 66666666677777777
240.00 2 . 88888889999999
167.00 3 . 000001111
146.00 3 . 22223333
153.00 3 . 44445555
118.00 3 . 666777
99.00 3 . 888999
106.00 4 . 000111
54.00 4 . 222
339.00 Extremes (>=43)

82
Box plot
(Box &
whisker)
● Useful for
interval and
ratio data
● Represents
min., max,
median,
quartiles, &
outliers

83
• Alternative to histogram
• Useful for screening
• Useful for comparing variables
• Can get messy - too much info
• Confusing to unfamiliar reader
Box plot (Box & whisker)
Participant Gender
FemaleMaleMissing
10
8
6
4
2
0
Time Management-T1
Self-Confidence-T1
44954162578259628414042044353275182341862330517623006559128211495
3201419358828475475400198324512898200336473
52157129504268724318255928345427211669040523444
4423423635403519067273946893137
3562338330403962312229
12255255545
2385410773323584004
552433515563
28294482267253154120226228451504231939983902646355221793020527435314997364541416412902548168628144167196326144171955174443826882822262617931747148
218736735510399522434250553623594998649620510638344230032962562527
35644317149302843626902101233519693009296541539905538229314216883634
27433593251521081985531655582138303424526783352317
2480296024926454284316542285186
419324766472662291
6084308
17
2699
3556334
1503275241623466255243493045
304032431371222596415943511907247380
4028
18082659
197862231372721142861
226520672270403852527688296021515564300430321938532836535506271835192336608405435012183292849986302224518624385114882241
27806412743294423212570661146542792576430229232476
231214932334
4308292014254307
569
5491

84
Data plot & error bar
Data plot Error bar

85
• Alternative to histogram
• Implies continuity e.g., time
• Can show multiple lines
Line graph
OVERALL SCALES-T3
OVERALL SCALES-T2
OVERALL SCALES-T1
OVERALL SCALES-T0
Mean
8.0
7.5
7.0
6.5
6.0
5.5
5.0

86
Graphical
integrity
(part of
academic
integrity)

87
"Like good writing, good graphical
displays of data communicate ideas
with clarity, precision, and efficiency.
Like poor writing, bad graphical
displays distort or obscure the data,
make it harder to understand or
compare, or otherwise thwart the
communicative effect which the
graph should convey."
Michael Friendly –
Gallery of Data Visualisation

88
Tufte’s graphical integrity
• Some lapses intentional, some not
• Lie Factor = size of effect in graph
size of effect in data
• Misleading uses of area
• Misleading uses of perspective
• Leaving out important context
• Lack of taste and aesthetics

89
Review exercise:
Fill in the cells in this table
Level Properties Examples Descriptive
Statistics
Graphs
Nominal
/Categorical
Ordinal /
Rank
Interval
Ratio
Answers: http://goo.gl/Ln9e1

90
References
1. Chambers, J., Cleveland, B., Kleiner, B., & Tukey, P. (1983).
Graphical methods for data analysis. Boston, MA: Duxbury
Press.
2. Cleveland, W. S. (1985). The elements of graphing data.
Monterey, CA: Wadsworth.
3. Jones, G. E. (2006). How to lie with charts. Santa Monica, CA:
LaPuerta.
4. Tufte, E. R. (1983). The visual display of quantitative information.
Cheshire, CT: Graphics Press.
5. Tufte. E. R. (2001). Visualizing quantitative data. Cheshire, CT:
Graphics Press.
6. Tukey J. (1977). Exploratory data analysis. Addison-Wesley.
7. Wild, C. J., & Seber, G. A. F. (2000). Chance encounters: A first
course in data analysis and inference. New York: Wiley.

91
Open Office Impress
● This presentation was made using
Open Office Impress.
● Free and open source software.
● http://www.openoffice.org/product/impress.html

Descriptives & Graphing

More Related Content

What's hot

Viewers also liked

Similar to Descriptives & Graphing

More from James Neill

Recently uploaded

Descriptives & Graphing

Editor's Notes