Introduction to Data
Analysis.
Levels of measurement and
Descriptive statistics
2
What’s this course about?
 Introduction to the use of quantitative data in
social science.
The tools we need in order to use numerical data
(i.e. anything we can count) to better understand
the world.
Very basic introduction, students intending to
write theses using primarily quantitative data
should also attend the intermediate/ advanced
lectures.
3
Why am I here?
 Your own research.
 Using quantitative data as an integral part of your
thesis.
 Using quantitative data as supplementary evidence.
 Making better use of qualitative data.
 Other people’s research.
 Understanding work in your area.
 Criticising work in your area.
 It’s compulsory….
4
Today’s lecture
 Different types of data.
 Descriptive statistics.
 Averages of a distribution of data.
 How to measure the ‘spread’ of a distribution.
 How (not) to lie with statistics.
 Graphs and best practice.
 Reading:
 A & F - Chapter 1 & 3
5
What is Statistics?
 Methods for:
Designing and carrying out research studies
Describing collected data
Making decisions/inferences about phenomena
represented by data
6
Some key terms (1)
 Population—the total set of individual objects
of persons of interest in a study
 Sample—a subset of the population that is
actually observed
7
Key Terms (2)
 Descriptive Stats consist of methods of
graphical and numerical techniques for
summarizing the information in a collection of
data
 Inferential stats consist of procedures for
making generalizations about characteristics of
a population, based on info from a sample.
8
Key terms (3)
 Parameters are the characteristics of the
population about which we make inferences
using sample data
 Statistics are the corresponding characteristics
of the sample data, upon which we base our
inferences about parameters.
9
Variables and their measurement
 Variable = measurement of a characteristic of
a subject (something or someone) that varies
across subjects in a population of subjects.
 Different levels of measurement, which means
that we have to examine different types of data
in different ways.
10
Nominal level measures (1)
 Just represent a category.
 e.g. Male
Female.
 e.g. Single
Married
Divorced.
 Since there is no ordering, these are nominal
measures.
 Often called qualitative, since two values
differ in quality not quantity.
11
Nominal level measures (2)
 Can quantify these data
by tabulating them.
 Normally represent
nominal data in a simple
table with percentages.
 Take the marital status
of all of my 25 friends
(i.e. the population we
are looking at is “all
Ryan’s friends”).
Marital
status
Number %
Single 18 72%
Married 6 24%
Divorced 1 4%
Total 25 100%
12
Ordinal level measures
 Categories again, but these categories are ordered.
 e.g. Many polling/survey questions.
 “It was right for Britain to send troops to Iraq”
 Strongly agree
 Agree
 Disagree
 Strongly disagree.
 The distance between each category is unknown.
 “Strong agreers” are more hawkish than “agreers”, but we have no idea
how much more hawkish they are.
 We can say on observation is greater in rank than another.
 Can be ranking in class (for example) or from naturally ordered categories
 Called quantitative because different values represent different
magnitudes.
13
Interval level measures
 Numbers represent a quantitative variable.
 e.g. Income, number of pupils per teacher, age, etc.
 There is a specific distance between each level.
 We can not only say that my sister is younger than I
am, but that she is 2 years younger.
 Age is a continuous variable, one can also subdivide the measure
(784 days, 3 hours and 2 minutes younger…).
 It is also true that my parents have only 2 children.
 Number of children is a discrete variable, you cannot sub-divide
children, you have 1, or 2, or 3. You can’t have 2 ½ children.
14
Descriptive statistics
 Most statistics that we will cover today apply
to variables that are interval level measures.
 Descriptive statistics are just that. They
describe a large amount of data in a summary
form.
 Why bother? Because we’re often interested in what a
typical person (or country or school or parliament etc.)
looks like.
15
Measuring the central tendency
 What we want to do is
reduce a lot of interval level
measurements to a few
numbers.
 The salaries of all of my
best friends (the population
is Ryan’s best friends).
 What is the typical annual
salary of a best friend of
mine.
Name Salary
Ellen £75,000
Jenny £13,000
Justin £31,000
Andrew £26,000
Mungo £15,000
16
The mean
 The most usual way of measuring the central
tendency is to use the mean (or average).
 This is simply the sum of the measurements
divided by the number of observations.
 For our salaried people:
 =
 Mean = £32,000
5
15,000
26,000
31,000
13,000
75,000 



17
A (very) little bit of math
 To introduce some terms which will be useful later, the
mean is calculated as follows. Suppose we have n
observations, with each value denoted by X1, X2 and so
on until Xn. Then the mean is described as follows:
)
...
(
1
2
1
___
n
X
X
X
n
X 






n
i
i
X
n
X
1
___
1
Or, to put it another way;
18
The mean’s properties
 Shift of origin of measurement.
 If everyone earns £2000 more, then the new mean salary
is just the old mean salary (£32,000) PLUS £2000.
 Change of scale.
 If we calculate salary in dollars (say £1 = $2), then the
new mean salary is simply twice the old mean salary.
 Sum of two variables.
 Imagine that income = salary + savings interest.
 Mean income = mean salary + mean savings interest.
19
The median
 Another common way deriving one number to
describe many is to use the median.
 Imagine we ranked all observations, the median
is simply the observation in the middle (½ of
observations above and ½ below).
 In ascending order the salaries are:
13,000; 15,000; 26,000; 31,000; 75,000.
 Median = £26,000.
 Median = ½(26,000+31,000) = 28,500.
20
The median’s properties (1)
 Shift of origin of measurement. YES
 Change of scale. YES
 Sum of two variables. NO
 The lack of this property is somewhat important (which
will become apparent in the following weeks), and is
related to one of the reasons why we generally use the
mean in most statistical analysis.
 Nonetheless, the median does have some advantages
over the mean in describing some types of data.
21
The median’s properties (2)
 For our salary example, the mean of my best friends’
salaries gives a substantially higher value than the
median (£6000 more).
 This is due to the distribution of the observations. For
the mean and median to be the same the distribution
of observations needs to be symmetrical.
 Imagine we now look at all my friends and
acquaintances (the population of 25 people as before),
and plot the frequency of each salary for all 25.
22
0
1
2
3
4
5
6
7
8
9
10,000 and
under
10,001 - 20,000 20,001 - 30,000 30,001 - 40,000 40,001 - 50,000 50,001 - 60,000 60,001 - 70,000 70,001 - 80,000
Salary
Frequency
Frequency graph of salaries
Median = 26,000
Mean = 34,000
23
Positions of the median and mean
 For distributions with a long tail to the right, the
mean will take a higher value than the median.
 This is generally true across the world for income
distributions, and is captured by Pen’s “parade of
dwarfs and a few giants”.
 If such a parade were organised today, then the person of mean
height (and income) would be taller (and richer) than 65% of the
population and so would pass by after 40 minutes had elapsed.
 Mean income is ~£24,000, median income is ~£16,000.
 For data with ‘outliers’ the median can give a better
idea of what the “typical” observation is like.
24
Ordinal level data
 The median can be used for ordinal level data.
 Imagine we had asked my 5 best friends about their
position on the Iraq war; 2 strongly agreed with
sending British troops, one agreed, one disagreed and
one strongly disagreed.
 We can rank these answers and then find the median.
Strongly agree; strongly agree; agree; disagree; strongly disagree.
 Thus the median answer is agree.
25
Nominal level data
 In general, we can’t use the median or mean
for nominal data.
Normally use the mode. This is the most
commonly occurring value.
 e.g if 53 people here are politics students, 40 sociology
students, and 46 are other subjects, then the modal value
is politics.
There is one special case in which we can use the
mean for nominal data however…
26
Nominal binary data
 …binary data is an exception as we can use the
mean. Binary data (e.g. Yes/No, Male/Female) can
be coded as 0 or 1.
 A variable measuring sex, men are coded 1 and women coded 0.
 The mean score for those 0s and 1s is the proportion of men.
There were 2 women and 3 men amongst my best friends.

 The median does NOT make sense for binary data. It just tells us
what the majority of the population is.
%
60
6
.
0
5
1
1
1
0
0
Mean 






27
Exercise
 Population is all countries
with nuclear capability, and
variable is approximate
number of nuclear weapons.
 What’s the mean, mode and
median for no. of nukes?
 How good is each of these
at summarizing the data, do
we need more information
than just a measure of
central tendency?
Country No. of nukes
USA 10,000
India 75
China 400
France 400
Britain 200
Russia 12,800
Israel 100
Pakistan 25
28
Some answers
 Mean = 24,000/8 = 3,000
 Median = (400+200)/2 = 300
 Mode = 400
 These summary measures are useful, but we
also need to know something about the
distribution, because two countries account for
virtually all the nuclear weapons in the world.
29
Measures of dispersion
 The mean (or median) tells us something about the centre of
the distribution, but what about its dispersion?
 The means/medians of the below distributions of children’s
scores on a maths test in three different classes are all the same
(48 observations, mean of 7, median of 7), but each tells a
quite different story.
0
5
10
15
20
25
1 2 3 4 5 6 7 8 9 10 11 12 13
0
5
10
15
20
25
1 2 3 4 5 6 7 8 9 10 11 12 13
0
5
10
15
20
25
1 2 3 4 5 6 7 8 9 10 11 12 13
30
The range
 The range is simply a measure of the distance
between the largest and smallest observations.
 The range for our salary example is therefore:
75,000 – 13,000 = 62,000.
 Clearly this is not ideal as it relies on only two
observations.
 Say we have 1000 poker players. 999 win nothing, and
1 wins £1million. The range indicates lots of variation,
when most people are actually identical.
31
The variance and standard deviation
 A better way of assessing how much values of
a variable vary around the mean is to use the
standard deviation or variance.
 Basic idea is to measure how different
individual values are from the mean value.
 Some of these deviations from the mean will
be positive and some negative, so we square
each deviation.
32
The variance
 Take my 5 best friends. The mean salary was £32,000.
 If we added up all the differences then we would get zero, so
we need to square the differences (i.e. multiply them by
themselves).
Ellen (75,000)
Jenny Mungo (15,000) Justin
Andrew
Mean=£32,000
Difference = 75,000 - 32,000 = 43,000
= 15,000 - 32,000 = -17,000
33
Calculating variance
 Salary example, with 5 obs, and mean of 32,000.
2
.
507
5
2536
deviations
squared
Total
Variance
2536
289
36
1
361
1849
deviations
squared
Total









n
Salary (000s) Deviation from mean Squared deviation
75 75 - 32 = 43 43 * 43 = 1849
13 13 – 32 = -19 -20 * -20 = 361
31 31 – 32 = -1 -1 * -1 = 1
26 26 – 32 = -6 -6 * -6 = 36
15 15 – 32 = -17 -17 * -17 = 289
34
Calculating standard deviation
 The standard deviation is the most common way to
measure deviation from the mean and is simply the
square root of the variance.
 We normally call the variance s2
and the standard deviation s. Thus
for our example, s2
= 507.2, and s = 22.5.
n
X
X
s
n
i
i









 1
2
__
**we usually use n-1 in the denominator
35
s = 1.02
Tight distribution
(All children perform
similarly)
Examples of standard deviation
0
5
10
15
20
25
1 2 3 4 5 6 7 8 9 10 11 12 13
0
5
10
15
20
25
1 2 3 4 5 6 7 8 9 10 11 12 13
0
5
10
15
20
25
1 2 3 4 5 6 7 8 9 10 11 12 13
s = 1.67
Clustered distribution
(Most children perform to a
similar level, with some
variation)
s = 4.01
Dispersed distribution
(One group of geniuses, one
group of idiots)
36
But what does it mean…?
 Our salary example had a standard deviation
of 22.5, but for the distributions above the s
varied between 1 and 4, what does this tell us?
 Best way to think of it is as a kind of rough
average distance of an observation to the
mean.
 Thus the standard deviation depends on the
units we are measuring in.
37
Standard deviation summary
 Broadly speaking, high levels of s indicate
greater variation, and the value of s gives a
broad idea of a typical distance from the mean.
 The concept of standard deviation is an
important one, and next week I’ll talk more
about particular types of distributions and their
properties.
38
How to (not) lie with statistics
 Even simple descriptive statistics can be
misused in order to mislead.
 Particularly the case for simple graphs.
 Most examples I will use here are from Edward Tufte
The Visual Display of Quantitative Information (1983,
and later reprints).
 See any copy of any of the Sunday papers for similar
glaring errors however.
39
Too little information
 Presenting too little summary information.
 Example courtesy of Tukey (1979) in JASA.
 Take Washoe County in Nevada, USA. There is a mean
population density of 13 ½ people per square mile.
 The mean is not informative without information on the
distribution however, for in fact 80% of the inhabitants
live in two cities.
 The cities have population densities of ~5000 per square mile.
 The rest of the county has a population density of 2 ½ people
per square mile.
40
Base years (1)
 Picking your base year (Tufte 1983).
41
Base years (2)
42
Measures over time (1)
43
Measures over time (2)
44
The lie factor
 Are doctors really becoming smaller?
45
Small differences
 Just because something’s top or bottom of a
list, doesn’t imply anything.
 The difference between top and bottom might
be very small.
 Close to home, look at the Norrington table for this. The
difference between the middle 10 colleges is essentially
zero, but it’s the ranking that everyone cares about.
 Ranking of countries by something like literacy rates is
often similarly futile. There has to be one at the top with
99.9% but all Western countries will have 99%+ rates…
46
(very) Small samples
 9 out of 10 cats prefer ‘Whiskers’…
 We may think that the evidence for this is
strong if thousands of cats had their opinion
solicited, but maybe weak if only 10 cats were
‘questioned’ out of the population of millions.
 Knowing when a small sample is too small is one of the
topics we will cover over the next two weeks and is a
critical part of understanding commonly used statistics.
47
“How to talk back to a statistic”
 Who says so?
 We all want to prove our own theories correct…
 How does he know?
 Is the data reputable?
 What’s missing?
 e.g. means are no use without standard deviations.
 Does it make sense?
 Social science is the science of the bloody obvious most of the
time. Don’t let numbers confuse or fool you; if it sounds wrong, it
probably is.
48
Next week
 Go back to ideas of distributions of data, and
commonly found distributions.
 Also look at sampling and surveys.
 Been discussing data where we have information on the
entire population (all my friends; all doctors in the US,
etc.).
 We might more normally have only a sample of
observations though.
 How accurate are samples in describing populations?

Lecture1_7012.ppt Research MethodologyResearch MethodologyResearch MethodologyResearch Methodology

  • 1.
    Introduction to Data Analysis. Levelsof measurement and Descriptive statistics
  • 2.
    2 What’s this courseabout?  Introduction to the use of quantitative data in social science. The tools we need in order to use numerical data (i.e. anything we can count) to better understand the world. Very basic introduction, students intending to write theses using primarily quantitative data should also attend the intermediate/ advanced lectures.
  • 3.
    3 Why am Ihere?  Your own research.  Using quantitative data as an integral part of your thesis.  Using quantitative data as supplementary evidence.  Making better use of qualitative data.  Other people’s research.  Understanding work in your area.  Criticising work in your area.  It’s compulsory….
  • 4.
    4 Today’s lecture  Differenttypes of data.  Descriptive statistics.  Averages of a distribution of data.  How to measure the ‘spread’ of a distribution.  How (not) to lie with statistics.  Graphs and best practice.  Reading:  A & F - Chapter 1 & 3
  • 5.
    5 What is Statistics? Methods for: Designing and carrying out research studies Describing collected data Making decisions/inferences about phenomena represented by data
  • 6.
    6 Some key terms(1)  Population—the total set of individual objects of persons of interest in a study  Sample—a subset of the population that is actually observed
  • 7.
    7 Key Terms (2) Descriptive Stats consist of methods of graphical and numerical techniques for summarizing the information in a collection of data  Inferential stats consist of procedures for making generalizations about characteristics of a population, based on info from a sample.
  • 8.
    8 Key terms (3) Parameters are the characteristics of the population about which we make inferences using sample data  Statistics are the corresponding characteristics of the sample data, upon which we base our inferences about parameters.
  • 9.
    9 Variables and theirmeasurement  Variable = measurement of a characteristic of a subject (something or someone) that varies across subjects in a population of subjects.  Different levels of measurement, which means that we have to examine different types of data in different ways.
  • 10.
    10 Nominal level measures(1)  Just represent a category.  e.g. Male Female.  e.g. Single Married Divorced.  Since there is no ordering, these are nominal measures.  Often called qualitative, since two values differ in quality not quantity.
  • 11.
    11 Nominal level measures(2)  Can quantify these data by tabulating them.  Normally represent nominal data in a simple table with percentages.  Take the marital status of all of my 25 friends (i.e. the population we are looking at is “all Ryan’s friends”). Marital status Number % Single 18 72% Married 6 24% Divorced 1 4% Total 25 100%
  • 12.
    12 Ordinal level measures Categories again, but these categories are ordered.  e.g. Many polling/survey questions.  “It was right for Britain to send troops to Iraq”  Strongly agree  Agree  Disagree  Strongly disagree.  The distance between each category is unknown.  “Strong agreers” are more hawkish than “agreers”, but we have no idea how much more hawkish they are.  We can say on observation is greater in rank than another.  Can be ranking in class (for example) or from naturally ordered categories  Called quantitative because different values represent different magnitudes.
  • 13.
    13 Interval level measures Numbers represent a quantitative variable.  e.g. Income, number of pupils per teacher, age, etc.  There is a specific distance between each level.  We can not only say that my sister is younger than I am, but that she is 2 years younger.  Age is a continuous variable, one can also subdivide the measure (784 days, 3 hours and 2 minutes younger…).  It is also true that my parents have only 2 children.  Number of children is a discrete variable, you cannot sub-divide children, you have 1, or 2, or 3. You can’t have 2 ½ children.
  • 14.
    14 Descriptive statistics  Moststatistics that we will cover today apply to variables that are interval level measures.  Descriptive statistics are just that. They describe a large amount of data in a summary form.  Why bother? Because we’re often interested in what a typical person (or country or school or parliament etc.) looks like.
  • 15.
    15 Measuring the centraltendency  What we want to do is reduce a lot of interval level measurements to a few numbers.  The salaries of all of my best friends (the population is Ryan’s best friends).  What is the typical annual salary of a best friend of mine. Name Salary Ellen £75,000 Jenny £13,000 Justin £31,000 Andrew £26,000 Mungo £15,000
  • 16.
    16 The mean  Themost usual way of measuring the central tendency is to use the mean (or average).  This is simply the sum of the measurements divided by the number of observations.  For our salaried people:  =  Mean = £32,000 5 15,000 26,000 31,000 13,000 75,000    
  • 17.
    17 A (very) littlebit of math  To introduce some terms which will be useful later, the mean is calculated as follows. Suppose we have n observations, with each value denoted by X1, X2 and so on until Xn. Then the mean is described as follows: ) ... ( 1 2 1 ___ n X X X n X        n i i X n X 1 ___ 1 Or, to put it another way;
  • 18.
    18 The mean’s properties Shift of origin of measurement.  If everyone earns £2000 more, then the new mean salary is just the old mean salary (£32,000) PLUS £2000.  Change of scale.  If we calculate salary in dollars (say £1 = $2), then the new mean salary is simply twice the old mean salary.  Sum of two variables.  Imagine that income = salary + savings interest.  Mean income = mean salary + mean savings interest.
  • 19.
    19 The median  Anothercommon way deriving one number to describe many is to use the median.  Imagine we ranked all observations, the median is simply the observation in the middle (½ of observations above and ½ below).  In ascending order the salaries are: 13,000; 15,000; 26,000; 31,000; 75,000.  Median = £26,000.  Median = ½(26,000+31,000) = 28,500.
  • 20.
    20 The median’s properties(1)  Shift of origin of measurement. YES  Change of scale. YES  Sum of two variables. NO  The lack of this property is somewhat important (which will become apparent in the following weeks), and is related to one of the reasons why we generally use the mean in most statistical analysis.  Nonetheless, the median does have some advantages over the mean in describing some types of data.
  • 21.
    21 The median’s properties(2)  For our salary example, the mean of my best friends’ salaries gives a substantially higher value than the median (£6000 more).  This is due to the distribution of the observations. For the mean and median to be the same the distribution of observations needs to be symmetrical.  Imagine we now look at all my friends and acquaintances (the population of 25 people as before), and plot the frequency of each salary for all 25.
  • 22.
    22 0 1 2 3 4 5 6 7 8 9 10,000 and under 10,001 -20,000 20,001 - 30,000 30,001 - 40,000 40,001 - 50,000 50,001 - 60,000 60,001 - 70,000 70,001 - 80,000 Salary Frequency Frequency graph of salaries Median = 26,000 Mean = 34,000
  • 23.
    23 Positions of themedian and mean  For distributions with a long tail to the right, the mean will take a higher value than the median.  This is generally true across the world for income distributions, and is captured by Pen’s “parade of dwarfs and a few giants”.  If such a parade were organised today, then the person of mean height (and income) would be taller (and richer) than 65% of the population and so would pass by after 40 minutes had elapsed.  Mean income is ~£24,000, median income is ~£16,000.  For data with ‘outliers’ the median can give a better idea of what the “typical” observation is like.
  • 24.
    24 Ordinal level data The median can be used for ordinal level data.  Imagine we had asked my 5 best friends about their position on the Iraq war; 2 strongly agreed with sending British troops, one agreed, one disagreed and one strongly disagreed.  We can rank these answers and then find the median. Strongly agree; strongly agree; agree; disagree; strongly disagree.  Thus the median answer is agree.
  • 25.
    25 Nominal level data In general, we can’t use the median or mean for nominal data. Normally use the mode. This is the most commonly occurring value.  e.g if 53 people here are politics students, 40 sociology students, and 46 are other subjects, then the modal value is politics. There is one special case in which we can use the mean for nominal data however…
  • 26.
    26 Nominal binary data …binary data is an exception as we can use the mean. Binary data (e.g. Yes/No, Male/Female) can be coded as 0 or 1.  A variable measuring sex, men are coded 1 and women coded 0.  The mean score for those 0s and 1s is the proportion of men. There were 2 women and 3 men amongst my best friends.   The median does NOT make sense for binary data. It just tells us what the majority of the population is. % 60 6 . 0 5 1 1 1 0 0 Mean       
  • 27.
    27 Exercise  Population isall countries with nuclear capability, and variable is approximate number of nuclear weapons.  What’s the mean, mode and median for no. of nukes?  How good is each of these at summarizing the data, do we need more information than just a measure of central tendency? Country No. of nukes USA 10,000 India 75 China 400 France 400 Britain 200 Russia 12,800 Israel 100 Pakistan 25
  • 28.
    28 Some answers  Mean= 24,000/8 = 3,000  Median = (400+200)/2 = 300  Mode = 400  These summary measures are useful, but we also need to know something about the distribution, because two countries account for virtually all the nuclear weapons in the world.
  • 29.
    29 Measures of dispersion The mean (or median) tells us something about the centre of the distribution, but what about its dispersion?  The means/medians of the below distributions of children’s scores on a maths test in three different classes are all the same (48 observations, mean of 7, median of 7), but each tells a quite different story. 0 5 10 15 20 25 1 2 3 4 5 6 7 8 9 10 11 12 13 0 5 10 15 20 25 1 2 3 4 5 6 7 8 9 10 11 12 13 0 5 10 15 20 25 1 2 3 4 5 6 7 8 9 10 11 12 13
  • 30.
    30 The range  Therange is simply a measure of the distance between the largest and smallest observations.  The range for our salary example is therefore: 75,000 – 13,000 = 62,000.  Clearly this is not ideal as it relies on only two observations.  Say we have 1000 poker players. 999 win nothing, and 1 wins £1million. The range indicates lots of variation, when most people are actually identical.
  • 31.
    31 The variance andstandard deviation  A better way of assessing how much values of a variable vary around the mean is to use the standard deviation or variance.  Basic idea is to measure how different individual values are from the mean value.  Some of these deviations from the mean will be positive and some negative, so we square each deviation.
  • 32.
    32 The variance  Takemy 5 best friends. The mean salary was £32,000.  If we added up all the differences then we would get zero, so we need to square the differences (i.e. multiply them by themselves). Ellen (75,000) Jenny Mungo (15,000) Justin Andrew Mean=£32,000 Difference = 75,000 - 32,000 = 43,000 = 15,000 - 32,000 = -17,000
  • 33.
    33 Calculating variance  Salaryexample, with 5 obs, and mean of 32,000. 2 . 507 5 2536 deviations squared Total Variance 2536 289 36 1 361 1849 deviations squared Total          n Salary (000s) Deviation from mean Squared deviation 75 75 - 32 = 43 43 * 43 = 1849 13 13 – 32 = -19 -20 * -20 = 361 31 31 – 32 = -1 -1 * -1 = 1 26 26 – 32 = -6 -6 * -6 = 36 15 15 – 32 = -17 -17 * -17 = 289
  • 34.
    34 Calculating standard deviation The standard deviation is the most common way to measure deviation from the mean and is simply the square root of the variance.  We normally call the variance s2 and the standard deviation s. Thus for our example, s2 = 507.2, and s = 22.5. n X X s n i i           1 2 __ **we usually use n-1 in the denominator
  • 35.
    35 s = 1.02 Tightdistribution (All children perform similarly) Examples of standard deviation 0 5 10 15 20 25 1 2 3 4 5 6 7 8 9 10 11 12 13 0 5 10 15 20 25 1 2 3 4 5 6 7 8 9 10 11 12 13 0 5 10 15 20 25 1 2 3 4 5 6 7 8 9 10 11 12 13 s = 1.67 Clustered distribution (Most children perform to a similar level, with some variation) s = 4.01 Dispersed distribution (One group of geniuses, one group of idiots)
  • 36.
    36 But what doesit mean…?  Our salary example had a standard deviation of 22.5, but for the distributions above the s varied between 1 and 4, what does this tell us?  Best way to think of it is as a kind of rough average distance of an observation to the mean.  Thus the standard deviation depends on the units we are measuring in.
  • 37.
    37 Standard deviation summary Broadly speaking, high levels of s indicate greater variation, and the value of s gives a broad idea of a typical distance from the mean.  The concept of standard deviation is an important one, and next week I’ll talk more about particular types of distributions and their properties.
  • 38.
    38 How to (not)lie with statistics  Even simple descriptive statistics can be misused in order to mislead.  Particularly the case for simple graphs.  Most examples I will use here are from Edward Tufte The Visual Display of Quantitative Information (1983, and later reprints).  See any copy of any of the Sunday papers for similar glaring errors however.
  • 39.
    39 Too little information Presenting too little summary information.  Example courtesy of Tukey (1979) in JASA.  Take Washoe County in Nevada, USA. There is a mean population density of 13 ½ people per square mile.  The mean is not informative without information on the distribution however, for in fact 80% of the inhabitants live in two cities.  The cities have population densities of ~5000 per square mile.  The rest of the county has a population density of 2 ½ people per square mile.
  • 40.
    40 Base years (1) Picking your base year (Tufte 1983).
  • 41.
  • 42.
  • 43.
  • 44.
    44 The lie factor Are doctors really becoming smaller?
  • 45.
    45 Small differences  Justbecause something’s top or bottom of a list, doesn’t imply anything.  The difference between top and bottom might be very small.  Close to home, look at the Norrington table for this. The difference between the middle 10 colleges is essentially zero, but it’s the ranking that everyone cares about.  Ranking of countries by something like literacy rates is often similarly futile. There has to be one at the top with 99.9% but all Western countries will have 99%+ rates…
  • 46.
    46 (very) Small samples 9 out of 10 cats prefer ‘Whiskers’…  We may think that the evidence for this is strong if thousands of cats had their opinion solicited, but maybe weak if only 10 cats were ‘questioned’ out of the population of millions.  Knowing when a small sample is too small is one of the topics we will cover over the next two weeks and is a critical part of understanding commonly used statistics.
  • 47.
    47 “How to talkback to a statistic”  Who says so?  We all want to prove our own theories correct…  How does he know?  Is the data reputable?  What’s missing?  e.g. means are no use without standard deviations.  Does it make sense?  Social science is the science of the bloody obvious most of the time. Don’t let numbers confuse or fool you; if it sounds wrong, it probably is.
  • 48.
    48 Next week  Goback to ideas of distributions of data, and commonly found distributions.  Also look at sampling and surveys.  Been discussing data where we have information on the entire population (all my friends; all doctors in the US, etc.).  We might more normally have only a sample of observations though.  How accurate are samples in describing populations?

Editor's Notes