Roadmap to Membership of RICS - Pathways and Routes
Collecting managing and assessing data using sample surveys
1.
2. Collecting, Managing, and Assessing Data
Using Sample Surveys
Collecting, Managing, and Assessing Data Using Sample Surveys
provides a thorough, step-by-step guide to the design and imple-
mentation of surveys. Beginning with a primer on basic statistics,
the first half of the book takes readers on a comprehensive tour
through the basics of survey design. Topics covered include the
ethics of surveys, the design of survey procedures, the design of
the survey instrument, how to write questions, and how to draw
representative samples. Having shown readers how to design sur-
veys, the second half of the book discusses a number of issues sur-
rounding their implementation, including repetitive surveys, the
economics of surveys, Web-based surveys, coding and data entry,
data expansion and weighting, the issue of nonresponse, and the
documenting and archiving of survey data. The book is an excel-
lent introduction to the use of surveys for graduate students as well
as a useful reference work for scholars and professionals.
peter stopher is Professor of Transport Planning at the
Institute of Transport and Logistics Studies at the University of
Sydney. He has also been a professor at Northwestern University,
Cornell University, McMaster University, and Louisiana State
University. Professor Stopher has developed a substantial reputa-
tion in the field of data collection, particularly for the support of
travel forecasting and analysis. He pioneered the development of
travel and activity diaries as a data collection mechanism, and has
written extensively on issues of sample design, data expansion,
nonresponse biases, and measurement issues.
6. 
To my wife, Carmen, with grateful thanks for your faith in me and your continuing
support and encouragement.
7.
8. vii
List of figures page╇ xix
List of tables xxii
Acknowledgements xxv
1 Introduction 1
1.1 The purpose of this book 1
1.2 Scope of the book 2
1.3 Survey statistics 4
2 Basic statistics and probability 6
2.1 Some definitions in statistics 6
2.1.1 Censuses and surveys 7
2.2 Describing data 8
2.2.1 Types of scales 8
Nominal scales 8
Ordinal scales 9
Interval scales 9
Ratio scales 10
Measurement scales 10
2.2.2 Data presentation: graphics 11
2.2.3 Data presentation: non-graphical 16
Measures of magnitude 17
Frequencies and proportions 17
Central measures of data 21
Measures of dispersion 34
The normal distribution 45
Some useful properties of variances and standard deviations 46
Proportions or probabilities 47
Data transformations 48
Covariance and correlation 50
Coefficient of variation 51
Contents
9. Contentsviii
Other measures of variability 53
Alternatives to Sturges’rule 62
3 Basic issues in surveys 64
3.1 Need for survey methods 64
3.1.1 A definition of sampling methodology 65
3.2 Surveys and censuses 65
3.2.1 Costs 66
3.2.2 Time 67
3.3 Representativeness 68
3.3.1 Randomness 69
3.3.2 Probability sampling 70
Sources of random numbers 71
3.4 Errors and bias 71
3.4.1 Sample design and sampling error 73
3.4.2 Bias 74
3.4.3 Avoiding bias 78
3.5 Some important definitions 78
4 Ethics of surveys of human populations 81
4.1 Why ethics? 81
4.2 Codes of ethics or practice 82
4.3 Potential threats to confidentiality 84
4.3.1 Retaining detail and confidentiality 85
4.4 Informed consent 86
4.5 Conclusions 89
5 Designing a survey 91
5.1 Components of survey design 91
5.2 Defining the survey purpose 93
5.2.1 Components of survey purpose 94
Data needs 94
Comparability or innovation 97
Defining data needs 99
Data needs in human subject surveys 99
Survey timing 100
Geographic bounds for the survey 101
5.3 Trade-offs in survey design 102
6 Methods for conducting surveys of human populations 104
6.1 Overview 104
6.2 Face-to-face interviews 105
6.3 Postal surveys 107
10. Contents ix
6.4 Telephone surveys 108
6.5 Internet surveys 111
6.6 Compound survey methods 112
6.6.1 Pre-recruitment contact 112
6.6.2 Recruitment 113
Random digit dialling 115
6.6.3 Survey delivery 117
6.6.4 Data collection 118
6.6.5 An example 119
6.7 Mixed-mode surveys 120
6.7.1 Increasing response and reducing bias 123
6.8 Observational surveys 125
7 Focus groups 127
7.1 Introduction 127
7.2 Definition of a focus group 128
7.2.1 The size and number of focus groups 128
7.2.2 How a focus group functions 129
7.2.3 Analysing the focus group discussions 131
7.2.4 Some disadvantages of focus groups 131
7.3 Using focus groups to design a survey 132
7.4 Using focus groups to evaluate a survey 134
7.5 Summary 135
8 Design of survey instruments 137
8.1 Scope of this chapter 137
8.2 Question type 137
8.2.1 Classification and behaviour questions 138
Mitigating threatening questions 139
8.2.2 Memory or recall error 142
8.3 Question format 145
8.3.1 Open questions 145
8.3.2 Field-coded questions 146
8.3.3 Closed questions 147
8.4 Physical layout of the survey instrument 150
8.4.1 Introduction 150
8.4.2 Question ordering 153
Opening questions 153
Body of the survey 154
The end of the questionnaire 158
8.4.3 Some general issues on question layout 159
Overall format 160
11. Contentsx
Appearance of the survey 161
Front cover 162
Spatial layout 163
Choice of typeface 164
Use of colour and graphics 166
Question numbering 169
Page breaks 170
Repeated questions 171
Instructions 172
Show cards 174
Time of the interview 174
Precoding 174
End of the survey 175
Some final comments on questionnaire layout 176
9 Design of questions and question wording 177
9.1 Introduction 177
9.2 Issues in writing questions 178
9.2.1 Requiring an answer 178
9.2.2 Ready answers 180
9.2.3 Accurate recall and reporting 181
9.2.4 Revealing the data 182
9.2.5 Motivation to answer 183
9.2.6 Influences on response categories 184
9.2.7 Use of categories and other responses 185
Ordered and unordered categories 187
9.3 Principles for writing questions 188
9.3.1 Use simple language 189
9.3.2 Number of words 190
9.3.3 Avoid using vague words 191
9.3.4 Avoid using ‘Tick all that apply’ formats 193
9.3.5 Develop response categories that are mutually exclusive
and exhaustive 193
9.3.6 Make sure that questions are technically correct 195
9.3.7 Do not ask respondents to say ‘Yes’ in order to say ‘No’ 196
9.3.8 Avoid double-barrelled questions 196
9.4 Conclusion 197
10 Special issues for qualitative and preference surveys 199
10.1 Introduction 199
10.2 Designing qualitative questions 199
10.2.1╇ Scaling questions 200
12. Contents xi
10.3 Stated response questions 206
10.3.1╇ The hypothetical situation 206
10.3.2╇ Determining attribute levels 207
10.3.3╇ Number of choice alternatives or scenarios 207
10.3.4╇ Other issues of concern 208
Data inconsistency 208
Lexicographic responses 209
Random responses 209
10.4 Some concluding comments on stated response survey design 210
11 Design of data collection procedures 211
11.1 Introduction 211
11.2 Contacting respondents 211
11.2.1╇ Pre-notification contacts 211
11.2.2╇ Number and type of contacts 213
Nature of reminder contacts 213
Postal surveys 215
Postal surveys with telephone recruitment 216
Telephone interviews 217
Face-to-face interviews 219
Internet surveys 220
11.3 Who should respond to the survey? 221
11.3.1╇ Targeted person 221
11.3.2╇ Full household surveys 223
Proxy reporting 224
11.4 Defining a complete response 225
11.4.1╇ Completeness of the data items 226
11.4.2╇ Completeness of aggregate sampling units 228
11.5 Sample replacement 229
11.5.1╇ When to replace a sample unit 229
11.5.2╇ How to replace a sample 233
11.6 Incentives 235
11.6.1╇ Recommendations on incentives 236
11.7 Respondent burden 240
11.7.1╇ Past experience 241
11.7.2╇ Appropriate moment 242
11.7.3╇ Perceived relevance 242
11.7.4╇ Difficulty 243
Physical difficulty 243
Intellectual difficulty 244
Emotional difficulty 245
Reducing difficulty 246
13. Contentsxii
11.7.5╇ External factors 246
Attitudes and opinions of others 246
The ‘feel good’effect 247
Appropriateness of the medium 248
11.7.6╇ Mitigating respondent burden 248
11.8 Concluding comments 250
12 Pilot surveys and pretests 251
12.1 Introduction 251
12.2 Definitions 252
12.3 Selecting respondents for pretests and pilot surveys 255
12.3.1╇ Selecting respondents 255
12.3.2╇ Sample size 258
Pilot surveys 258
Pretests 261
12.4 Costs and time requirements of pretests and pilot surveys 262
12.5 Concluding comments 264
13 Sample design and sampling 265
13.1 Introduction 265
13.2 Sampling frames 266
13.3 Random sampling procedures 268
13.3.1╇ Initial considerations 268
13.3.2╇ The normal law of error 269
13.4 Random sampling methods 270
13.4.1╇ Simple random sampling 271
Drawing the sample 271
Estimating population statistics and sampling errors 273
Example 276
Sampling from a finite population 279
Sampling error of ratios and proportions 279
Defining the sample size 281
Examples 283
13.4.2╇ Stratified sampling 285
Types of stratified samples 285
Study domains and strata 287
Weighted means and variances 287
Stratified sampling with a uniform sampling fraction 289
Drawing the sample 289
Estimating population statistics and sampling errors 290
Pre- and post-stratification 291
Example 293
14. Contents xiii
Equal allocation 294
Summary of proportionate sampling 295
Stratified sampling with variable sampling fraction 295
Drawing the sample 295
Estimating population statistics and sampling errors 296
Non-coincident study domains and strata 296
Optimum allocation and economic design 297
Example 298
Survey costs differing by stratum 300
Example 301
Practical issues in drawing disproportionate samples 303
Concluding comments on disproportionate sampling 305
13.4.3╇ Multistage sampling 305
Drawing a multistage sample 306
Requirements for multistage sampling 307
Estimating population values and sampling statistics 308
Example 309
Concluding comments on multistage sampling 314
13.5 Quasi-random sampling methods 314
13.5.1╇ Cluster sampling 316
Equal clusters: population values and standard errors 317
Example 319
The effects of clustering 321
Unequal clusters: population values and standard errors 322
Random selection of unequal clusters 324
Example 325
Stratified sampling of unequal clusters 326
Paired selection of unequal-sized clusters 327
13.5.2╇ Systematic sampling 328
Population values and standard errors in a systematic
sample 328
Simple random model 329
Stratified random model 329
Paired selection model 329
Successive difference model 330
Example 330
13.5.3╇ Choice-based sampling 333
13.6 Non-random sampling methods 334
13.6.1╇ Quota sampling 334
13.6.2╇ Intentional, judgemental, or expert samples 335
13.6.3╇ Haphazard samples 335
13.6.4╇ Convenience samples 336
13.7 Summary 336
15. Contentsxiv
14 Repetitive surveys 337
14.1 Introduction 337
14.2 Non-overlapping samples 338
14.3 Incomplete overlap 339
14.4 Subsampling on the second and subsequent occasions 341
14.5 Complete overlap: a panel 342
14.6 Practical issues in designing and conducting panel surveys 343
14.6.1╇ Attrition 344
Replacement of panel members lost by attrition 345
Reducing losses due to attrition 346
14.6.2╇ Contamination 347
14.6.3╇ Conditioning 348
14.7 Advantages and disadvantages of panels 348
14.8 Methods for administering practical panel surveys 349
14.9 Continuous surveys 352
15 Survey economics 356
15.1 Introduction 356
15.2 Cost elements in survey design 357
15.3 Trade-offs in survey design 359
15.3.1╇ Postal surveys 360
15.3.2╇�Telephone recruitment with a postal survey with or
without telephone retrieval� 361
15.3.3╇ Face-to-face interview 362
15.3.4╇ More on potential trade-offs 362
15.4 Concluding comments 363
16 Survey implementation 365
16.1 Introduction 365
16.2 Interviewer selection and training 365
16.2.1╇ Interviewer selection 365
16.2.2╇ Interviewer training 368
16.2.3╇ Interviewer monitoring 369
16.3 Record keeping 370
16.4 Survey supervision 372
16.5 Survey publicity 373
16.5.1╇ Frequently asked questions, fact sheet, or brochure 374
16.6 Storage of survey forms 374
16.6.1╇ Identification numbers 375
16.7 Issues for surveys using posted materials 377
16.8 Issues for surveys using telephone contact 377
16.8.1╇ Caller ID 378
16.8.2╇ Answering machines 378
16. Contents xv
16.8.3╇ Repeated requests for callback 380
16.9 Data on incomplete responses 381
16.10╇ Checking survey responses 382
16.11╇ Times to avoid data collection 383
16.12╇ Summary comments on survey implementation 383
17 Web-based surveys 385
17.1 Introduction 385
17.2 The internet as an optional response mechanism 388
17.3 Some design issues for Web surveys 389
17.3.1╇ Differences between paper and internet surveys 389
17.3.2╇ Question and response 390
17.3.3╇ Ability to fill in the Web survey in multiple sittings 392
17.3.4╇ Progress tracking 393
17.3.5╇ Pre-filled responses 394
17.3.6╇ Confidentiality in Web-based surveys 395
17.3.7╇ Pictures, maps, etc. on Web surveys 395
Animation in survey pictures and maps 396
17.3.8╇ Browser software 396
User interface design 396
Creating mock-ups 397
Page loading time 398
17.4 Some design principles for Web surveys 398
17.5 Concluding comments 399
18 Coding and data entry 401
18.1 Introduction 401
18.2 Coding 402
18.2.1╇ Coding of missing values 402
18.2.2╇ Use of zeros and blanks in coding 403
18.2.3╇ Coding consistency 404
Binary variables 404
Numeric variables 404
18.2.4╇ Coding complex variables 405
18.2.5╇ Geocoding 406
Requesting address details for other places than home 408
Pre-coding of buildings 409
Interactive gazetteers 410
Other forms of geocoding assistance 410
Locating by mapping software 411
18.2.6╇ Methods for creating codes 412
18.3 Data entry 413
18.4 Data repair 416
17. Contentsxvi
19 Data expansion and weighting 418
19.1 Introduction 418
19.2 Data expansion 419
19.2.1╇ Simple random sampling 419
19.2.2╇ Stratified sampling 419
19.2.3╇ Multistage sampling 420
19.2.4╇ Cluster samples 420
19.2.5╇ Other sampling methods 421
19.3 Data weighting 421
19.3.1╇ Weighting with unknown population totals 422
An example 423
A second example 424
19.3.2╇ Weighting with known populations 426
An example 427
19.4 Summary 429
20 Nonresponse 431
20.1 Introduction 431
20.2 Unit nonresponse 432
20.2.1╇ Calculating response rates 432
Classifying responses to a survey 433
Calculating response rates 435
20.2.2╇ Reducing nonresponse and increasing response rates 440
Design issues affecting nonresponse 440
Survey publicity 442
Use of incentives 442
Use of reminders and repeat contacts 443
Personalisation 444
Summary 445
20.2.3╇ Nonresponse surveys 445
20.3 Item nonresponse 450
20.3.1╇ Data repair 450
Flagging repaired variables 451
Inference 452
Imputation 452
Historical imputation 453
Average imputation 454
Ratio imputation 454
Regression imputation 455
Cold-deck imputation 456
Hot-deck imputation 457
Expectation maximisation 457
18. Contents xvii
Multiple imputation 458
Imputation using neural networks 458
Summary of imputation methods 460
20.3.2╇ A final note on item nonresponse 460
Strategies to obtain age and income 461
Age 461
Income 462
21 Measuring data quality 464
21.1 Introduction 464
21.2 General measures of data quality 464
21.2.1╇ Missing value statistic 465
21.2.2╇ Data cleaning statistic 466
21.2.3╇ Coverage error 467
21.2.4╇ Sample bias 468
21.3 Specific measures of data quality 469
21.3.1╇ Non-mobility rates 469
21.3.2╇ Trip rates and activity rates 470
21.3.3╇ Proxy reporting 471
21.4 Validation surveys 472
21.4.1╇ Follow-up questions 473
21.4.2╇ Independent measurement 475
21.5 Adherence to quality measures and guidance 476
22 Future directions in survey procedures 478
22.1 Dangers of forecasting new directions 478
22.2 Some current issues 478
22.2.1╇ Reliance on telephones 478
Threats to the use of telephone surveys 479
Conclusions on reliance on telephones 481
22.2.2╇ Language and literacy 481
Language 481
Literacy 483
22.2.3╇ Mixed-mode surveys 486
22.2.4╇ Use of administrative data 487
22.2.5╇ Proxy reporting 488
22.3 Some possible future directions 489
22.3.1╇�A GPS survey as a potential substitute for a household
travel survey� 493
The effect of multiple observations of each respondent
on sample size 495
19. Contentsxviii
23 Documenting and archiving 499
23.1 Introduction 499
23.2 Documentation or the creation of metadata 499
23.2.1╇ Descriptive metadata 500
23.2.2╇ Preservation metadata 503
23.2.3╇ Geospatial metadata 503
23.3 Archiving of data 506
References 511
Index� 525
20. xix
Figures
2.1 Scatter plot of odometer reading versus model year� page 12
2.2 Scatter plot of fuel type by body type 12
2.3 Pie chart of vehicle body types 13
2.4 Pie chart of household income groups 13
2.5 Histogram of household income 14
2.6 Histogram of vehicle types 14
2.7 Line graph of maximum and minimum temperatures for thirty days 15
2.8 Ogive of cumulative household income data from Figure 2.5 16
2.9 Relative ogive of household income 16
2.10 Relative step chart of household income 17
2.11 Stem and leaf display of income 22
2.12 Arithmetic mean as centre of gravity 24
2.13 Bimodal distribution of temperatures 25
2.14 Distribution of maximum temperatures from Table 2.4 29
2.15 Distribution of minimum temperatures from Table 2.4 30
2.16 Income distribution from Table 2.5 30
2.17 Distribution of vehicle counts 33
2.18 Box and whisker plot of income data from Table 2.5 36
2.19 Box and whisker plot of maximum temperatures 37
2.20 Box and whisker plot of minimum temperatures 37
2.21 Box and whisker plot of vehicles passing through the green phase 43
2.22 Box and whisker plot of children’s ages 45
2.23 The normal distribution 45
2.24 Comparison of normal distributions with different variances 46
2.25 Scatter plot of maximum versus minimum temperature 52
2.26 A distribution skewed to the right 54
2.27 A distribution skewed to the left 54
2.28 Distribution with low kurtosis 55
2.29 Distribution with high kurtosis 55
3.1 Extract of random numbers from the RAND Million Random Digits 72
4.1 Example of a consent form 87
21. List of figuresxx
4.2 First page of an example subject information sheet 88
4.3 Second page of the example subject information sheet 89
5.1 Schematic of the survey process 92
5.2 Survey design trade-offs 103
6.1 Schematic of survey methods 113
8.1 Document file layout for booklet printing 162
8.2 Example of an unacceptable questionnaire format 164
8.3 Example of an acceptable questionnaire format 165
8.4 Excerpt from a survey showing arrows to guide respondent 168
8.5 Extract from a questionnaire showing use of graphics 169
8.6 Columned layout for asking identical questions about multiple people 171
8.7 Inefficient and efficient structures for organising serial questions 172
8.8 Instructions placed at the point to which they refer 173
8.9 Example of an unacceptable questionnaire format with response codes 175
9.1 Example of a sequence of questions that do not require answers 178
9.2 Example of a sequence of questions that do require answers 179
9.3 Example of a belief question 181
9.4 Example of a belief question with a more vague response 181
9.5 Two alternative response category sets for the age question 185
9.6 Alternative questions on age 186
9.7 Examples of questions with unordered response categories 187
9.8 An example of mixed ordered and unordered categories 188
9.9 Reformulated question from Figure 9.8 189
9.10 An unordered alternative to the question in Figure 9.8 189
9.11 Avoiding vague words in question wording 192
9.12 Example of a failure to achieve mutual exclusivity and exhaustiveness 194
9.13 Correction to mutual exclusivity and exhaustiveness 195
9.14 Example of a double negative 196
9.15 Example of removal of a double negative 196
9.16 An alternative that keeps the wording of the measure 197
9.17 An alternative way to deal with a double-barrelled question 197
10.1 Example of a qualitative question 200
10.2 Example of a qualitative question using number categories 200
10.3 Example of unbalanced positive and negative categories 201
10.4 Example of balanced positive and negative categories 201
10.5 Example of placing the neutral option at the end 202
10.6 Example of distinguishing the neutral option from ‘No opinion’ 202
10.7 Use of columned layout for repeated category responses 203
10.8 Alternative layout for repeated category responses 204
10.9 Statements that call for similar responses 204
10.10 Statements that call for varying responses 205
10.11 Rephrasing questions to remove requirement for ‘Agree’/‘Disagree’ 206
11.1 Example of a postcard reminder for the first reminder 215
22. List of figures xxi
11.2 Framework for understanding respondent burden 241
14.1 Schematic of the four types of repetitive samples 338
14.2 Rotating panel showing recruitment, attrition, and rotation 353
18.1 An unordered set of responses requiring coding 402
18.2 A possible format for asking for an address 409
18.3 Excerpt from a mark-sensing survey 415
20.1 Illustration of the categorisation of response outcomes 436
20.2 Representation of a neural network model 459
23.1 Open archival information system model 508
23. xxii
2.1 Frequencies and proportions of vehicle types� page 18
2.2 Frequencies, proportions, and cumulative values for household
income 19
2.3 Minimum and maximum temperatures for a month (°C) 20
2.4 Grouped temperature data 21
2.5 Disaggregate household income data 22
2.6 Growth rates of an investment fund, 1993–2004 26
2.7 Speeds by kilometre for a train 27
2.8 Measurements of ball bearings 29
2.9 Number of vehicles passing through the green phase of a traffic light 32
2.10 Sorted number of vehicles passing through the green phase 32
2.11 Number of children by age 34
2.12 Deviations from the mean for the income data of Table 2.5 38
2.13 Outcomes from throwing the die twice 40
2.14 Sorted number of vehicles passing through the green phase 43
2.15 Deviations for vehicles passing through the green phase 44
2.16 Values of variance and standard deviation for values of p and q 47
2.17 Deviations for vehicles passing through the green phase raised to third
and fourth powers 57
2.18 Deviations from the mean for children’s ages 58
2.19 Data on household size, annual income, and number of vehicles for
forty households 59
2.20 Deviations needed for covariance and correlation estimates 61
3.1 Heights of 100 (fictitious) university students (cm) 76
3.2 Sample of the first and last five students 76
3.3 Sample of the first ten students 76
3.4 Intentional sample of ten students 77
3.5 Random sample of ten students (in order drawn) 77
3.6 Summary of results from Tables 3.2 to 3.5 77
6.1 Internet world usage statistics 112
Tables
24. List of tables xxiii
6.2 Mixed-mode survey types (based on Dillman and Tarnai, 1991) 121
11.1 Selection grid by age and gender 222
13.1 Partial listing of households for a simple random sample 272
13.2 Excerpt of random numbers from the RAND Million Random Digits 273
13.3 Selection of sample of 100 members using four-digit groups from
Table 13.2 274
13.4 Data from twenty respondents in a fictitious survey 276
13.5 Sums of squares for population groups 286
13.6 Data for drawing an optimum household travel survey sample 299
13.7 Optimal allocation of the 2,000-household sample 299
13.8 Optimal allocation and expected sampling errors by stratum 300
13.9 Results of equal allocation for the household travel survey 300
13.10 Given information for economic design of the optimal allocation 301
13.11 Preliminary sample sizes and costs for economic design of the
optimum allocation 301
13.12 Estimation of the final sample size and budget 302
13.13 Comparison of optimal allocation, equal allocation, and economic
design for $150,000 survey 302
13.14 Comparison of sampling errors from the three sample designs 303
13.15 Desired stratum sample sizes and results of recruitment calls 305
13.16 Distribution of departments and students 310
13.17 Two-stage sample of students from the university 311
13.18 Multistage sample using disproportionate sampling at the first stage 313
13.19 Calculations for standard error from sample in Table 13.18 315
13.20 Examples of cluster samples 316
13.21 Cluster sample of doctor’s files 320
13.22 Random drawing of blocks of dwelling units 326
13.23 Calculations for paired selections and successive differences 332
18.1 Potential complex codes for income categories 406
18.2 Example codes for use of the internet and mobile phones 407
19.1 Results of an hypothetical household survey 424
19.2 Calculation of weights for the hypothetical household survey 424
19.3 Two-way distribution of completed surveys 424
19.4 Two-way distribution of terminated surveys 425
19.5 Table 19.3 expressed as percentages 425
19.6 Sum of the cells in Tables 19.3 and 19.4 425
19.7 Cells of Table 19.6 as percentages 426
19.8 Weights derived from Tables 19.7 and 19.5 426
19.9 Results of an hypothetical household survey compared to
secondary source data 427
19.10 Two-way distribution of completed surveys by percentage
(originally shown in Table 19.5) 427
19.11 Results of factoring the rows of Table 19.10 428
25. List of tablesxxiv
19.12 Second iteration, in which columns are factored 428
19.13 Third iteration, in which rows are factored again 429
19.14 Weights derived from the iterative proportional fitting 429
20.1 Final disposition codes for RDD telephone surveys 439
23.1 Preservation metadata elements and description 504
26. xxv
As is always the case, many people have assisted in the process that has led to this book.
First, I would like to acknowledge all those, too numerous to mention by name, who
have helped me over the years, to learn and understand some of the basics of design-
ing and implementing surveys. They have been many and they have taught me much
of what I now know in this field. However, having said that, I would particularly like
to acknowledge those whom I have worked with over the past fifteen years or more on
the International Steering Committee for Travel Survey Conferences (ISCTSC), who
have contributed enormously to broadening and deepening my own understandings of
surveys. In particular, I would like to mention, in no particular order, Arnim Meyburg,
Martin Lee-Gosselin, Johanna Zmud, Gerd Sammer, Chester Wilmot, Werner Brög,
Juan de Dios Órtuzar, Manfred Wermuth, Kay Axhausen, Patrick Bonnel, Elaine
Murakami, Tony Richardson, (the late) Pat van der Reis, Peter Jones, Alan Pisarski,
Mary Lynn Tischer, Harry Timmermans, Marina Lombard, Cheryl Stecher, Jean-Loup
Madre, Jimmy Armoogum, and (the late) Ryuichi Kitamura. All these individuals have
inspired and helped me and contributed in various ways to this book, most of them,
probably, without realising that they have done so.
I would also like to acknowledge the support I have received in this endeavour from
the University of Sydney, and especially from the director of the Institute of Transport
and Logistics Studies, Professor David Hensher. Both David and the university have
provided a wide variety of support for the writing and production of this book, for
which I am most grateful.
However, most importantly, I would like to acknowledge the enormous support and
encouragement from my wife, Carmen, and her patience, as I have often spent long
hours on working on this book, and her unquestioning faith in me that I could do it. She
has been an enduring source of strength and inspiration to me. Without her, I doubt that
this book would have been written.
As always, a book can see the light of day only through the encouragement and
support of a publisher and those assisting in the publishing process. I would like to
acknowledge Chris Harrison of Cambridge University Press, who first thought that
this book might be worth publishing and encouraged me to develop the outline for
Acknowledgements
27. Acknowledgementsxxvi
it, and then provided critical input that has helped to shape the book into what it has
become. I would also like to thank profusely Mike Richardson, who carefully and thor-
oughly copy-edited the manuscript, improving immensely its clarity and complete-
ness. I would also like to thank Joanna Breeze, the production editor at Cambridge.
She has worked with me with all the delays I have caused in the book production, and
has still got this book to publication in a very timely manner. However, as always, and
in spite of the help of these people, any errors that remain in the book are entirely my
responsibility.
Finally, I would like to acknowledge the contributions made by the many students I
have taught over the years in this area of survey design. The interactions we have had,
the feedback I have received, and the enjoyment I have had in being able to teach this
material and see students understand and appreciate what good survey design entails
have been most rewarding and have also contributed to the development of this book. I
hope that they and future students will find this book to be of help to them and a contin-
uing reference to some of those points that we have discussed.
Peter Stopher
Blackheath, New South Wales
August 2011
28. 1
1 Introduction
1.1â•… The purpose of this book
There are a number of books available that treat various aspects of survey design, sam-
pling, survey implementation, and so forth (examples include Cochran, 1963; Dillman,
1978, 2000; Groves and Couper, 1998; Kish, 1965; Richardson, Ampt, and Meyburg,
1995; andYates, 1965). However, there does not appear to be a single book that covers
all aspects of a survey, from the inception of the survey itself through to archiving the
data. This is the purpose of this book. The reader will find herein a complete treatment
of all aspects of a survey, including all the elements of design, the requirements for
testing and refinement, fielding the survey, coding and analysing the resulting data,
documenting what happened, and archiving the data, so that nothing is lost from what
is inevitably an expensive process.
This book concentrates on surveys of human populations, which are both more chal-
lenging generally and more difficult both to design and to implement than most sur-
veys of non-human populations. In addition, because of the background of the author,
examples are drawn mainly from surveys in the area of transport planning. However,
the examples are purely illustrative; no background is needed in transport planning to
understand the examples, and the principles explained are applicable to any survey that
involves human response to a survey instrument. In spite of this focus on human partic-
ipation in the survey process, there are occasional references to other types of surveys,
especially observational and counting types of surveys.
In writing this book, the author has tried to make this as complete a treatment as pos-
sible. Although extensive references are included to numerous publications and books
in various aspects of measuring data, the reader should be able to find all that he or she
requires within the covers of this book. This includes a chapter on some basic aspects
of statistics and probability that are used subsequently, particularly in the development
of the statistical aspects of surveys.
In summary, then, the purpose of this book is to provide the reader with an exten-
sive and, as far as possible, exhaustive treatment of issues involved in the design and
execution of surveys of human populations. It is the intent that, whether the reader is
a student, a professional who has been asked to design and implement a survey, or
29. Introduction2
someone attempting to gain a level of knowledge about the survey process, all ques-
tions will be answered within these pages. This is undoubtedly a daunting task. The
reader will be able to judge the extent to which this has been achieved. The book is also
designed that someone who has no prior knowledge of statistics, probability, surveys,
or the purposes to which surveys may be put can pick up and read this book, gaining
knowledge and expertise in doing so. At the same time, this book is designed as a ref-
erence book. To that end, an extensive index is provided, so that the user of this book
who desires information on a particular topic can readily find that topic, either from the
table of contents, or through the index.
1.2â•… Scope of the book
As noted in the previous section, the book starts with a treatment of some basic statis-
tics and probability. The reader who is familiar with this material may find it appro-
priate to skip this chapter. However, for those who have already learnt material of this
type but not used it for a while, as well as those who are unfamiliar with the material,
it is recommended that this chapter be used as a means for review, refreshment, or
even first-time learning. It is then followed by a chapter that outlines some basic issues
of surveys, including a glossary of terms and definitions that will be found helpful
in reading the remainder of the book. A number of fundamental issues, pertinent to
overall survey design, are raised in this chapter. Chapter 4 introduces the topic of the
ethics of surveys, and outlines a number of ethical issues and proposes a number of
basic ethical standards to which surveys of human populations should adhere. The
fifth chapter of the book discusses the primary issues of designing a survey. A major
underlying theme of this chapter is that there is no such thing as an ‘all-purpose sur-
vey’. Experience has repeatedly demonstrated that only surveys designed with a clear
purpose in mind can be successful.
The next nine chapters deal with all the various design issues in a survey, given that
we have established the overall purpose or purposes of the survey. The first of these
chapters (Chapter 6) discusses and describes all the current methods that are available
for conducting surveys of human populations, in which people are asked to partic-
ipate in the survey process. Mention is also made of some methods of dealing with
other types of survey that are appropriate when the objects of the survey are observed
in some way and do not participate in the process. In Chapter 7, the topic of focus
groups is introduced, and potential uses of focus groups in designing quantitative and
qualitative surveys are discussed. The chapter does not provide an exhaustive treat-
ment of this topic, but does provide a significant amount of detail on how to organise
and design focus groups. In Chapter 8, the design of survey instruments is discussed
at some length. Illustrations of some principles of design are included, drawn princi-
pally from transport and related surveys. Chapters 9 and 10 deal with issues relating
to question design and question wording and special issues relating to qualitative and
preference surveys. Chapter 11 deals with the design of data collection procedures
themselves, including such issues as item and unit nonresponse, what constitutes a
30. Scope of the book 3
complete response, the use of proxy reporting and its effects, and so forth. The seventh
of this group of chapters (Chapter 12) deals with pilot surveys and pretests€– a topic
that is too often neglected in the design of surveys. A number of issues in designing
and undertaking such surveys and tests are discussed. Chapter 13 deals with the topic
of sample design and sampling issues. In this chapter, there is extensive treatment of
the statistics of sampling, including estimation of sampling errors and determination of
sample sizes. The chapter describes most of the available methods of sampling, includ-
ing simple random samples, stratified samples, multistage samples, cluster samples,
systematic samples, choice-based samples, and a number of sampling methods that are
often considered but that should be avoided in most instances, such as quota samples,
judgemental samples, and haphazard samples.
Chapter 14 addresses the topic of repetitive surveys. Many surveys are intended to
be done as a ‘one-off’ activity. For such surveys, the material covered in the preceding
chapters is adequate. However, there are many surveys that are intended to be repeated
from time to time. This chapter deals with such issues as repeated cross-sectional sur-
veys, panel surveys, overlapping samples, and continuous surveys. In particular, this
chapter provides the reader with a means to compare the advantages and disadvantages
of the different methods, and it also assists in determining which is appropriate to
apply in a given situation.
Chapter 15 builds on the material in the preceding chapters and deals with the issue
of survey economics. This is one of the most troublesome areas, because, as many
companies have found out, it is all too easy to be bankrupted by a survey that is under-
taken without a real understanding and accounting of the costs of a survey. While
information on actual costs will date very rapidly, this chapter attempts to provide rel-
ative data on costs, which should help the reader estimate the costs of different survey
strategies. This chapter also deals with many of the potential trade-offs in the design
of surveys.
Chapter 16 delves into some of the issues relating to the actual survey implemen-
tation process. This includes issues relating to training survey interviewers and moni-
toring the performance of interviewers, and the chapter discusses some of the danger
signs to look for during implementation. This chapter also deals with issues regarding
the ethics of survey implementation, especially the relationships between the survey
firm, the client for the survey, and the members of the public who are the respondents
to the survey. Chapter 17 introduces a topic that is becoming of increasing interest:
Web-based surveys. Although this is a field that is as yet quite young, there are an
increasing number of aspects that have been researched and from which the reader can
benefit. Chapter 18 deals with the process of coding and data entry. A major issue in
this topic is the geographic coding of places that may be requested in a survey.
Chapter 19 addresses the topics of data expansion and weighting. Data expansion is
outlined as a function of the sampling method, and statistical procedures for expanding
each of the different types of sample are provided in this chapter. Weighting relates to
problems of survey bias, resulting either from incomplete coverage of the population in
the sampling process or from nonresponse by some members of the subject population.
31. Introduction4
This is an increasingly problematic area for surveys of human populations, resulting
from a myriad of issues relating to voluntary participation. Chapter 20 addresses the
issue of nonresponse more completely. Here, issues of who is likely to respond and
who is not are discussed. Methods to increase response rates are described, and refer-
ence is made again to the economics of the survey design. The question of computing
response rates is also addressed in this chapter. This is usually the most widely recog-
nised statistic for assessing the quality of a survey, but it is also a statistic that is open
to numerous methods of computation, and there is considerable doubt as to just what
it really means.
Chapter 21 deals with a range of other measures of data quality, some that are gen-
eral and some, by way of example, that are specific to surveys in transport. These mea-
sures are provided as a way to illustrate how survey-specific measures of quality can
be devised, depending on the purposes of the survey. Chapter 22 discusses some issues
of the future of human population surveys, especially in the light of emerging technol-
ogies and their potential application and misapplication to the survey task.
Chapter 23, the final chapter in the book, covers the issues of documenting and
archiving the data. This all too often neglected area of measuring data is discussed at
some length.A list of headings for the final report on the survey is provided, along with
suggestions as to what should be included under the headings. The issue of archiving
data is also addressed at some length. Data are expensive to collect and are rarely
archived appropriately. The result is that many expensive surveys are effectively lost
soon after the initial analyses are undertaken. In addition, knowledge about the survey
is often lost when those who were most centrally involved in the survey move on to
other assignments, or leave to work elsewhere.
1.3â•… Survey statistics
Statistics in general, and survey statistics in particular, constitute a relatively young
area of theory and practice. The earliest instance of the use of statistics is probably in
the middle of the sixteenth century, and related to the start of data collection in France
regarding births, marriages, and deaths, and in England to the collection of data on
deaths in London each week (Berntson et al., 2005). It was then not until the middle
of the eighteenth century that publications began to appear advancing some of the ear-
liest theories in statistics and probability. However, much of the modern development
of statistics did not take place until the late nineteenth and early twentieth centuries
(Berntson et al., 2005):
Beginning around 1880, three famous mathematicians, Karl Pearson, Francis Galton and Edgeworth,
created a statistical revolution in Europe. Of the three mathematicians, it was Karl Pearson, along
with his ambition and determination, that led people to consider him the founder of the twentieth-
century science of statistics.
It was only in the early twentieth century that most of the now famous names in sta-
tistics made their contributions to the field. These included such statisticians as Karl
32. Survey statistics 5
Pearson, Francis Galton, C. R. Rao, R. A. Fisher, E. S. Pearson, and Jerzy Neyman,
among many others, who all made major contributions to what we know today as the
science of statistics and probability.
Survey sampling statistics is of even more recent vintage. Among the most notable
names in this field of study are those of R. A. Fisher, Frank Yates, Leslie Kish, and
W. G. Cochran. Fisher may have given survey sampling its birth, both through his own
contributions and through his appointment of Frank Yates as assistant statistician at
Rothamsted Experimental Station in 1931. In this post, Yates developed, often in col-
laboration with Fisher, what may be regarded as the beginnings of survey sampling in
the form of experimental designs (O’Connor and Robertson, 1997). His book Sampling
Methods for Censuses and Surveys was first published in 1949, and it appears to be the
first book on statistical sampling designs.
Leslie Kish, who founded the Survey Research Institute at the University of
Michigan, is also regarded as one of the founding fathers of modern survey sampling
methods, and he published his seminal work, called Survey Sampling, in 1965. Close
in time to Kish, W. G. Cochran published his seminal work, Sampling Techniques, in
1963.
Based on these efforts, the science of survey sampling cannot be considered to be
much over fifty years old€– a very new scientific endeavour. As a result of this rela-
tive recency, there is still much to be done in developing the topic of survey sampling,
while technologies for undertaking surveys have undergone and continue to undergo
rapid evolution. The fact that most of the fundamental books on the topic are about
forty years old suggests that it is time to undertake an updated treatise on the topic.
Hence, this book has been undertaken.
33. 6
2 Basic statistics and probability
2.1â•… Some definitions in statistics
Statistics is defined by the Oxford Dictionary of English Etymology as ‘the political
science concerned with the facts of a state or community’, and the word is derived
from the German statistisch. The beginning of modern statistics was in the sixteenth
century, when large amounts of data began to be collected on the populations of coun-
tries in Europe, and the task was to make sense of these vast amounts of data. As statis-
tics has evolved from this beginning, it has become a science concerned with handling
large quantities of data, but also with using much smaller amounts of data in an effort
to represent entire populations, when the task of handling data on the entire population
is too large or expensive. The science of statistics is concerned with providing inputs
to political decision making, to the testing of hypotheses (understanding what would
happen if …), drawing inferences from limited data, and, considering the data limita-
tions, doing all these things under conditions of uncertainty.
A word used commonly in statistics and surveys is population. The population is
defined as the entire collection of elements of concern in a given situation. It is also
sometimes referred to as a universe. Thus, if the elements of concern are pre-school
children in a state, then the population is all the pre-school children in the state at the
time of the study. If the elements of concern are elephants in Africa, then the popula-
tion consists of all the elephants currently in Africa. If the elements of concern are the
vehicles using a particular freeway on a specified day, then the population is all the
vehicles that use that particular freeway on that specific day.
It is very clear that statistics is the study of data. Therefore, it is necessary to
understand what is meant by data. The word data is a plural noun from the Latin
datum, meaning given facts. As used in English, the word means given facts from
which other facts may be inferred. Data are fundamental to the analysis and model-
ling of real-world phenomena, such as human populations, the behaviour of firms,
weather systems, astronomical processes, sociological processes, genetics, etc.
Therefore, one may state that statistics is the process for handling and analysing
data, such that useful conclusions can be drawn, decisions made, and new knowledge
accumulated.
34. Some definitions in statistics 7
Another word used in connection with statistics is observation. An observation may
be defined as the information that can be seen about a member of a subject population.
An observation comprises data about relevant characteristics of the member of the
population. This population may be people, households, galaxies, private firms, etc.
Another way of thinking of this is that an observation represents an appropriate group-
ing of data, in which each observation consists of a set of data items describing one
member of the population.
A parameter is a quantity that describes some property of the population. Parameters
may be given as numbers, proportions, or percentages. For example, the number of
male pre-school children in the state might be 16,897, and this number is a parameter.
The proportion of baby elephants in Africa might be 0.39, indicating that 39 per cent
of all elephants in Africa at this time are babies. This is also a parameter. Sometimes,
one can define a particular parameter as being critical to a decision. This would then
be called a decision parameter. For example, suppose that a decision is to be made
as to whether or not to close a primary school. The decision parameter might be the
number of schoolchildren that would be expected to attend that school in, say, the next
five years.
A sample is some subset of a population. It may be a large proportion of the popu-
lation, or a very small proportion of the population. For example, a survey of Sydney
households, which comprise a population of about 1,300,000 might consist of 130,000,
households (a 10 per cent sample) or 300 households (a 0.023 per cent sample).
A statistic is a numerical quantity that describes a sample. It is therefore the equiva-
lent of a parameter, but for a sample rather than the population. For example, a survey
of 130,000 households in Sydney might have shown that 52 per cent of households
own their own home or are buying it. This would be a statistic. If, on the other hand, a
figure of 54 per cent was determined from a census of the 1,300,000 households, then
this figure would be a parameter.
Statistical inference is the process of making statements about a population based
on limited evidence from a sample study. Thus, if a sample of 130,000 households
in Sydney was drawn, and it was determined that 52 per cent of these owned or were
purchasing their homes, then statistical inference would lead one to propose that this
might mean that 676,000 (52 per cent of 1,300,000) households in Sydney own or are
purchasing their homes.
2.1.1â•… Censuses and surveys
Of particular relevance to this book is the fact that there are two methods for collect-
ing data about a population of interest. The first of these is a census, which involves
making observations of every member of the population. Censuses of the human pop-
ulation have been undertaken in most countries of the world for many years. There are
references in the Bible to censuses taken among the early Hebrews, and later by the
Romans at the time of the birth of Christ. In Europe, most censuses began in the eigh-
teenth century, although a few began earlier than that. In the United States of America,
35. Basic statistics and probability8
censuses began in the nineteenth century. Many countries undertake a census once
in each decade, either in the year ending in zero or in one. Some countries, such as
Australia, undertake a census twice in each decade. A census may be as simple as a
head count (enumerating the total size of the population) or it may be more complex,
by collecting data on a number of characteristics of each member of the population,
such as name, address, age, country of birth, etc.
A survey is similar to a census, except that it is conducted on a subset of the popula-
tion, not the entire population.A survey may involve a large percentage of the population
or may be restricted to a very small sample of the population. Much of the science of
survey statistics has to do with how one makes a small sample represent the entire popu-
lation. This is discussed in much more detail in the next chapter. A survey, by definition,
always involves a sample of the population. Therefore, to speak of a 100 per cent sample
is contradictory; if it is a sample, it must be less than 100 per cent of the population.
2.2â•… Describing data
One of the first challenges for statistics is to describe data. Obviously, one can provide
a complete set of data to a decision maker. However, the human mind is not capable
of utilising such information efficiently and effectively. For example, a census of the
United States would produce observations on over 300 million people, while one of
India would produce observations of over 1 billion people. A listing of those observa-
tions represents something that most human beings would be incapable of utilising.
What is required, then, is to find some ways to simplify and describe data, so that use-
ful information is preserved but the sheer magnitude of the underlying data is hidden,
thereby not distracting the human analyst or decision maker.
Before examining ways in which data might be presented or described, such that the
mind can grasp the essential information contained therein, it is important to under-
stand the nature of different types of data that can be collected. To do this, it seems
useful to consider the measurement of a human population, especially since that is the
main topic of the balance of this book.
In mathematical statistics, we refer to things called variables. A variable is a char-
acteristic of the population that may take on differing or varying values for different
members of the population. Thus, variables that could be used to describe members
of a human population may include such characteristics as name, address, age or date
of birth, place of birth, height, weight, eye colour, hair colour, and shoe size. Each of
these characteristics provides differing levels of information that can be used in vari-
ous ways. We can divide these characteristics into four different types of scales, a scale
representing a way of measuring the characteristic.
2.2.1â•… Types of scales
Nominal scales
Each person in the population has a name. The person’s name represents a label by
which that person can be identified, but provides little other information. Names can
36. Describing data 9
be ordered alphabetically or can be ordered in any of a number of arbitrary ways, such
as the order in which data are collected on individuals. However, no information is
provided by changing the order of the names. Therefore, the only thing that the name
provides is a label for each member of the population. This is called a nominal scale.
A nominal scale is the least informative of the different types of scales that can be used
to measure characteristics, but its lack of other information does not render it of less
value. Other examples of nominal data are the colours of hair or eyes of the members
of the population, bus route numbers, the numbers assigned to census collection dis-
tricts, names of firms listed on a country’s stock exchange, and the names of magazines
stocked by a newsagency.
Ordinal scales
Each person in the population has an address. The address will usually include a house
number and a street name, along with the name of the town or suburb in which the
house is located. The address clearly also represents a label, just as does the person’s
name. However, in the case of the address, there is more information provided. If the
addresses are sorted by number and by street, in most places in the world this will pro-
vide additional information. These sorted addresses will actually help an investigator
to locate each home, in that it is expected that the houses are arranged in numerical
order along the street, and probably with odd numbers on one side of the street and
even numbers on the other side. As a result, there is order information provided in the
address. It is, therefore, known as an ordinal scale. However, if it is known that one
person lives at 27 Main Street, and another person lives at 35 Main Street, this does not
indicate how far apart these two people live. In some countries, they could be next door
to each other, while in others there might be three houses between them or even seven
houses between them (if numbering goes down one side of the street and back on the
other). The only thing that would be known is that, starting at the first house on Main
Street, one would arrive at 27 before one would arrive at 35. Therefore, order is the
only additional information provided by this scale. Other examples of ordinal scales
would be the list of months in the year, censor ratings of movies, and a list of runners
in the order in which they finished a race.
Interval scales
Each person in the population has a shoe size. For the purposes of this illustration,
the fact that there are slight inconsistencies in shoe sizes between manufacturers will
be ignored, and it will be assumed, instead, that a man’s shoe size nine is the same
for all men’s shoes, for example. Shoe size is certainly a label, in that a shoe can be
called a size nine or a size twelve, and so forth. This may be a useful way of labelling
shoes for a lot of different reasons. In addition, there is clearly order information, in
that a size nine is smaller than a size twelve, and a size seven is larger than a size five.
Furthermore, within each of children’s, men’s, and women’s shoes, each increase in
a size represents a constant increase in the length of the shoe. Thus, the difference
between a size nine and a size ten shoe for a man is the same as the difference between
a size eight and a size nine, and so on for any two adjacent numbers. In other words,
37. Basic statistics and probability10
there is a constant interval between each shoe size. On the other hand, there is no nat-
ural zero in this scale (in fact, a size of zero generally does not exist), and it is not true
that a size five is half the length of a size ten. Therefore, shoe size may be considered
to be an interval scale. Women’s dress sizes in a number of countries also represent
an interval scale, in which each increment in dress size represents a constant interval
of increase in size of the dress, but a size sixteen dress is not twice as large as a size
eight. In many cases, the sizing of an item of clothing as small, medium, large, etc. also
represents an interval scale. Another example of an interval scale is the normal scale of
temperature in either degrees Celsius or degrees Fahrenheit. An interval of one degree
represents the same increase or decrease in temperature, whether it is between 40 and
41 or 90 and 91. However, we are not able to state that 60 degrees is twice as hot as
30€degrees. There is also not a natural zero on either the Celsius scale or the Fahrenheit
scale. Indeed, the Celsius scale sets the temperature at which water freezes as 0, but
the Fahrenheit scale sets this at 32, and there is not a particular physical property of the
zero on the Fahrenheit scale.
Ratio scale
Each member of the population has a height and a weight. Again, each of these two
measures could be used as a label. We might say that a person is 180 centimetres tall,
or weighs 85 kilograms. These measures also contain ordinal information. We know
that a person who weighs 85 kilograms is heavier than a person who weighs 67 kilo-
grams. Furthermore, we know that these measures contain interval information. The
difference between 179 centimetres and 180 centimetres is the same as the difference
between 164 centimetres and 165 centimetres. However, there is even more information
in these measures. There is ratio information. In other words, we know that a person
who is 180 centimetres tall is twice as tall as a person who is 90 centimetres tall, and
that a person weighing 45 kilograms is only half the weight of a person weighing 90
kilograms. There are two important new pieces of information provided by these mea-
sures. First, there is a natural zero in the measurement scale. Both weight and height
have a zero point, which represents the absence of weight or the absence of height.
Second, there is a multiplicative relationship among the measures on the scale, not just
an additive one. Therefore, both weight and height are described as ratio scales. Other
examples of ratio scales are distance or length measures, measures of speed, measures
of elapsed time, and so forth. However, it should be noted that measurement of clock
time is interval-scaled (there is no natural zero, and 5 a.m. is not a half of 10 a.m.),
while elapsed time is ratio-scaled, because zero minutes represents the absence of any
elapsed time, and twenty minutes is twice as long as ten minutes, for example.
Measurement scales
The preceding sections have outlined four scales of measurement: nominal, ordinal,
interval, and ratio. They have also demonstrated that these four scales are themselves
an ordinal scale, in which the order, as presented in the preceding sentence, indicates
increasing information content. Furthermore, each of the scale types, as ordered above,
38. Describing data 11
contains the information of the previous type of scale, and then adds new information
content. Thus an ordinal scale also has nominal information, but adds to that informa-
tion on order; an interval scale has both nominal and ordinal information, but adds to
that a consistent interval of measurement; and a ratio scale contains all nominal, ordi-
nal, and interval information, but adds ratio relationships to them.
There are two other ways in which scales can be described, because most scales can
be measured in different ways. The first of these relates to whether the scale is contin-
uous or discrete. A continuous scale is one in which the measurement can be made to
any degree of precision desired. For example, we can measure elapsed time to the near-
est hour, or minute, or second, or nanosecond, etc. Indeed, the only thing that limits
the precision by which we can measure this scale is the precision of our instruments
for measurement. However, there is no natural limit to precision in such cases. This is
a continuous scale. A discrete scale, on the other hand, cannot be subdivided beyond
a certain point. For example, shoe sizes are a discrete scale. Many shoe manufactur-
ers will provide shoes in half-size increments, while others will provide them only in
whole-size increments. Subdivision below half sizes simply is not done. Similarly, any
measurement that involves counting objects, such as counting the number of members
of a population, is a discrete scale. We cannot have fractional people, fractional houses,
or fractional cars, for example.
The second descriptor of a scale is whether it is inherently exact or approximate. By
their nature, all continuous scales are approximate. This is so because we can always
increase the precision of measurement. Generally, numbers obtained from counting
are exact, unless the counting mechanism is capable of error. However, other discrete
scales may be approximate or exact. In most clothing or shoe sizes, the measure would
be considered approximate, because sizes often differ between manufacturers, and
between countries. A size nine shoe is not the same size in the United States and in the
United Kingdom, for example, nor is it necessarily the same size from two different
shoe makers in the same country.
It is important to recognise what type of a scale we are dealing with, when infor-
mation is measured on scales, because the type of scale will also often either dictate
how the information can be presented or restrict the analyst to certain ways of pre-
sentation. Similarly, whether the measure is discrete or continuous will also affect the
presentation of the data, as will, in some cases, whether the data are approximate or
exact.
2.2.2â•… Data presentation: graphics
It is appropriate to start with some simple rules about graphical presentations. There
are four principal types of graphical presentation: scatter plots, pie charts, histograms
or bar charts, and line graphs.
A scatter plot is a plot of the frequency with which specific values of a pair of vari-
ables occur in the data. Thus, the X-axis of the plot will contain the values of one of the
variables that are found in the data, and the Y-axis will contain the values of the other
39. Basic statistics and probability12
variable. As such, any type of measure can be presented on a scatter plot. However, if
all values occur only once€– i.e., are unique to an observation€– then a scatter plot is
of no particular interest. Therefore although any data can theoretically be plotted on a
scatter plot, data that represent unique values, or data that are continuous, and also will
probably have frequencies of only one or two at most for any pair of values, will not
be illuminated by a scatter plot.
An example of a scatter plot is provided in Figure 2.1, which shows a scatter plot of
odometer readings of cars versus the model year of the vehicle. The Y-axis is a ratio-
scaled variable, and the X-axis is an interval-scaled variable. The scatter plot indicates
that there probably is a relationship between odometer readings and model year, such
that the higher the model year value, the lower the odometer reading, as would be
expected. This is a useful scatter plot.
Figure 2.2 illustrates a scatter plot of two nominal-scaled variables: fuel type versus
body type. It is not a very useful illustration of the data. First, we cannot tell how many
points fall at each combination of values. Second, all it really tells us is that there are
no taxis (body type 5) in this data set, that all vehicle types use petrol (fuel type 1), that
all except motorcycles (body type 6) use diesel (fuel type 2), and that only cars (body
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1950 1960 1970 1980 1990 2000 2010
Model year
Odometerreading
Figure 2.1╇ Scatter plot of odometer reading versus model year
0
1
2
3
4
5
0 1 2 3 4 5 6 7 8
Body type
Fueltype
Figure 2.2╇ Scatter plot of fuel type by body type
40. Describing data 13
type 1), four-wheel drive (4WD) vehicles (body type 2), and utility/van/panel vans
(body type 3) use dual fuel (fuel type 4). This illustrates that nominal data€– both fuel
type and body type are nominal scales€– may not produce a useful scatter plot.
A pie chart is a circle that is divided into segments representing specific values in
the data, with the length of the segment along the circumference of the circle indicating
how frequently the value occurs in the data.Again, pie charts can be used with any type
of data, when the information to be presented is the frequency of occurrence. However,
they will generally not work with continuous data, unless the data are first grouped and
converted to discrete categories. An example of a pie chart is provided in Figure 2.3.
This shows that the pie chart works well for nominal data, in this case the vehicle body
type from a survey of households.
Figure 2.4 shows a pie chart for category data€– i.e., discrete data. The data are
reported household incomes from a survey of households. The categories were those
used in the survey. Income, being measured in dollars and with a natural zero, is actu-
ally a ratio scale. In the categories collected, income is a ratio-scaled discrete measure.
Again, the pie chart provides a good representation of the data.
A histogram or bar chart is used for presenting discrete data. Such data will be
interval- or ratio-scaled data. Histograms can be constructed in several different ways.
When presenting complex information, bars can be stacked, showing how different
4WD
Car
Motorcycle
Other
Taxi
Truck
Utility vehicle
Figure 2.3╇ Pie chart of vehicle body types
None
$1–$4,159
$4,160–$8,319
$8,320–$15,599
$15,600–$25,999
$26,000–$36,399
$36,400–$51,999
$52,000–$62,399
$62,400–$77,999
$78,000–$103,999
$104,000+
Don't know
Refused
Figure 2.4╇ Pie chart of household income groups
41. Basic statistics and probability14
classes of items add up to a total within each bar. Bars can also be plotted so that each
bar touches the next, or they may be plotted with gaps between. There is no particular
rule for plotting bars in this manner, and it is more a matter of personal preference.
Examples of two types of histograms are shown in Figures 2.5 and 2.6. Histograms can
also be used to indicate the frequency of occurrence of specific values of both nominal
and ordinal data. In this case, it is preferred that the bars do not touch, the spaces indi-
cating that the scale is not interval or ratio.
Figure 2.5 shows ratio-scaled discrete data on household incomes, this time in a
two-dimensional histogram or bar chart. Note that the bars touch, indicating the under-
lying continuous nature of the data. Figure 2.6 shows a histogram of nominal data
frequencies of vehicle type for household vehicles. Two instructive observations may
be made of this histogram. First, the dominance of the car tends to make the histogram
0
20
40
60
80
100
120
140
160
Numberof
respondents
N
one$1–$4,159
$4,160–$8,319
$8,320–$15,599
$15,600–$25,999
$26,000–$36,399
$36,400–$51,999
$52,000–$62,399
$62,400–$77,999
$78,000–$103,999$104,000+
Annual income
Figure 2.5╇ Histogram of household income
0
200
400
600
800
1,000
1,200
4WD Car Motorcycle Other Taxi Truck Utility
vehicle
Vehicle type
Number
Figure 2.6╇ Histogram of vehicle types
42. Describing data 15
somewhat less useful. In contrast, the pie chart really communicated the information
better. Second, the bars do not touch, in this case clearly indicating the discrete cate-
gories of a nominal scale.
The fourth type of chart is a line graph. This is much more restricted in application
than the other types of charts. A line graph should be used only with continuous data,
whether interval- or ratio-scaled. It is inappropriate to use line graphs to present data
that are discrete, or data that are nominal or ordinal in nature. An example of a line
graph is shown in Figure 2.7.
Temperature is inherently a continuous measurement. It is therefore appropriate to
use a line graph to present these data. This case demonstrates the use of two lines on
the same graph. This allows one not only to see the maximum and minimum tempera-
tures, but also to deduce that there may be a relationship between the two.
A special type of line graph is an ogive. An ogive is a cumulative frequency line.
Even when the original data are discrete in nature, the ogive can be plotted as a line,
although a cumulative histogram is preferable. Generally, it makes sense to create
cumulative graphs only of interval- or ratio-scaled data, although the data may be
either discrete or continuous. Figure 2.8 shows an ogive for the income data used in
Figure 2.5.
The ogive is essentially an S-shaped curve, in that it starts with a line that is along
the X-axis and ends with a line that is parallel to the X-axis, with the line climbing
more or less continuously from the X-axis at the left to the top of the graph at the right.
A special case of the ogive is a relative ogive, in which the proportions or percentage
of observations are used, not the absolute counts, as in Figure 2.8. A relative ogive for
0
5
10
15
20
25
30
35
Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
Monday
Day of week
Temperature(°C)
Maximum temperature
Minimum temperature
Figure 2.7╇ Line graph of maximum and minimum temperatures for thirty days
43. Basic statistics and probability16
the same data will have the same shape, but the scale of the Y-axis changes, as shown
in Figure 2.9.
A step chart, which is the discrete version of an ogive, could also be drawn for
the income data. It can use either the count, the proportion, or the percentage for the
Y-axis. A step chart is shown in Figure 2.10.
2.2.3â•… Data presentation: non-graphical
Graphical presentations of data are very useful. As can be seen in the preceding sec-
tion, the adage that ‘a picture is worth a thousand words’ is clearly interpretable as
‘a picture is worth a thousand numbers’. Indeed, one can grasp rather readily from
0
100
200
300
400
500
600
700
800
900
N
one$1–$4,159
$4,160–$8,319$8,320–$15,599
$15,600–$25,999
$26,000–$36,399
$36,400-$51,999
$52,000–$62,399
$62,400–$77,999
$78,000–$103,999$104,000+
Household income
Cumulativenumber
Figure 2.8╇ Ogive of cumulative household income data from Figure 2.5
0
0.2
0.4
0.6
0.8
1
$0
$1–$4,159
$4,160–$8,319
$8,320–$15,599
$15,600–$25,999
$26,000–$36,399
$36,400–$51,999
$52,000–$62,399
$62,400–$77,999
$78,000–$103,999$104,000+
Household income
Cumulativeproportion
Figure 2.9╇ Relative ogive of household income
44. Describing data 17
the graphs what is potentially a large amount of data, which the human mind would
have difficulty grasping as raw data. However, pictures are not the only ways in
which data can be presented for easier assimilation. There are also numeric ways to
describe data. Ideally, what one would like would be some summary variables that
would give one an idea about the magnitude of each variable in the data, the disper-
sion of values, the variability of the values, and the symmetry or lack of symmetry
in the data.
Measures of magnitude
These measures could include such concepts as frequencies of occurrence of particular
values in the data, proportions of the data that possess a particular value, cumulative
frequencies or proportions, and some form of average value. Each of these measures
is considered separately.
Frequencies and proportions
Frequencies are simply counts of the number of times that a particular value occurs in
the data, while proportions are frequencies divided by the total number of observations
in the data. Table 2.1 shows the frequencies of occurrence of the different vehicle types
used in the earlier illustrations of graphical presentations.
For nominal data, cumulative frequencies or proportions are not sensible, because
the scale does not contain any ordered information. Thus, to produce a cumulative
frequency distribution for the entries in Table 2.1 would not make sense. Moreover, it
should be noted that frequencies and proportions are generally sensible only for dis-
crete data. However, if, in continuous data, there are large numbers of observations
with the same value and the data set is large, then frequencies and proportions may
possibly be useful. This, for example, might be the case for national data derived from
a census.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
$0
$1–$4,159
$4,160–$8,319
$8,320–$15,599
$15,600–$25,999
$26,000–$36,399
$36,400–$51,999
$52,000–$62,399
$62,400–$77,999
$78,000–$103,999
$104,000+
Household income
Cumulativeproportion
Figure 2.10╇ Relative step chart of household income
45. Basic statistics and probability18
For the income data plotted in Figure 2.5, both frequencies or proportions, and cumu-
lative frequencies or proportions, make sense. These are shown in Table 2.2. From the
information in Table 2.2 it is possible to grasp several things about the data on household
income, such as the fact that the largest group is the one with $15,600–$25,999 annual
income, followed by $78,000–$103,999 and $36,400–$51,999. It can also be seen that
16 per cent of households would not report their income. When the non-reported income
is excluded, one can see that the proportions change substantially, and that a half of the
population have incomes below $62,399. In effect, this table has summarised over 1,000
pieces of data and made them comprehensible, by presenting just a handful of numbers.
In the case of the income data shown in Table 2.2, the groups were defined in the sur-
vey itself. However, one may also take data that are collected as continuous measures
and group them, both to display as a histogram and to present them in a table, similar
to Table 2.2. In such a case, it is necessary to know into how many categories to group
the data.
Number of classes or categoriesâ•… Sturges’ rule (Sturges, 1926) provides
guidance on how to determine the maximum number of classes into which to divide
data, whether grouping already discrete data or continuous data. There are a number
of elements to the rule.
(1) Interval classes must be inclusive and non-overlapping.
(2) Intervals should usually be of equal width, although the first and last interval may
be open-ended for some types of data.
(3) The number of classes depends on the number of observations in the data, accord-
ing to equation (2.1):
k = 1 + 3.322 × (log10â•›n) (2.1)
where k = the number of classes, n = the number of observations.
Suppose, for example, that the income data had been collected as actual annual
income, and not in income classes. One might then ask the question as to how many
Table 2.1 Frequencies and proportions of vehicle types
Vehicle type Frequency Proportion
Car 1,191 0.817
4WD 96 0.066
Utility vehicle 134 0.092
Truck 10 0.007
Taxi 0 0
Motorcycle 19 0.013
Other 7 0.005
Total 1457 1.000
46. Describing data 19
classes would be the maximum that could be used for income. This would be obtained
by substituting 900 into the above equation, because one should not include the miss-
ing data. This would result in a value for k of 10.81, which would be truncated to 10.
Therefore, Sturges’ rule would indicate that the maximum number of intervals that
should be used for these data is ten. The data were actually collected in eleven classes.
Therefore, this would suggest that the design was marginally appropriate and there
should not be a need to group together any of the classes with the number of valid
observations obtained. However, the intervals used violate Sturges’ rule in one respect,
in that they are not of equal size. This is not uncommon with income grouping, where
it is often the case, as here, that the lower incomes are divided into smaller classes than
the higher incomes. This is generally done to keep the population of the classes more
nearly equal.
Suppose that the temperature data used in Figure 2.7 were to be grouped into clas-
ses. The raw data are shown in Table 2.3. There are thirty observations of daily maxi-
mum and minimum temperatures in this data set. Applying Sturges’ rule, the value of
k is found to equal 5.92, suggesting that five intervals would be the most that could be
used. For the high temperatures, the range is twenty-two to thirty-three. If this range
is divided into groupings of two degrees, this would produce six intervals, while using
three degrees would produce four intervals. In this case, given that k was found to be
close to six, it would be best to use six intervals of two degrees per interval. For the low
temperatures, the range is from sixteen to twenty-two. Grouping these also into groups
of two degrees in size, which is preferable when one wants to look at both minimum
Table 2.2 Frequencies, proportions, and cumulative values for household income
Income range Frequency Proportion
Cumulative
frequency
Cumulative proportion
Including
missing
Excluding
missing
None 28 0.0262 28 0.0262 0.0311
$1–$4,159 2 0.0019 30 0.0280 0.0333
$4,160–$8,319 11 0.0103 41 0.0383 0.0456
$8,320–$15,599 67 0.0626 108 0.1009 0.1200
$15,600–$25,999 155 0.1449 263 0.2458 0.2922
$26,000–$36,399 97 0.0907 360 0.3364 0.4000
$36,400–$51,999 129 0.1206 489 0.4570 0.5433
$52,000–$62,399 72 0.0673 561 0.5243 0.6233
$62,400–$77,999 105 0.0981 666 0.6224 0.7400
$78,000–$103,999 133 0.1243 799 0.7467 0.8878
$104,000+ 101 0.0944 900 0.8411 1.0000
Don’t know 1 0.0009 901 0.8421
Refused 169 0.1579 1,070 1.0000
Total 1,070 1
47. Basic statistics and probability20
and maximum temperatures on the same graph, or in side-by-side graphs, would result
in four groups. Because this is less than the maximum of six, it is acceptable. In this
case, grouping is sensible only if what one wants to do is to create a histogram of the
frequency with which various maximum and minimum temperatures occur. Such a
frequency table is shown in Table 2.4.
There is a second variant of Sturges’ rule for binary data. This variant defines the
number of classes, as shown in equation (2.2):
k = 1 + log2(n) (2.2)
Table 2.3 Minimum and maximum
temperatures for a month (°C)
Day
Maximum
temperature
Minimum
temperature
Sunday 23 18
Monday 26 19
Tuesday 25 19
Wednesday 27 17
Thursday 32 22
Friday 29 21
Saturday 26 20
Sunday 27 19
Monday 30 22
Tuesday 31 21
Wednesday 33 23
Thursday 24 20
Friday 25 18
Saturday 27 19
Sunday 28 20
Monday 32 22
Tuesday 24 18
Wednesday 26 16
Thursday 25 17
Friday 22 17
Saturday 28 19
Sunday 27 20
Monday 28 20
Tuesday 29 21
Wednesday 28 20
Thursday 26 19
Friday 27 20
Saturday 30 21
Sunday 29 20
Monday 31 23
48. Describing data 21
When n is less than 1,000, the two equations result in approximately the same num-
ber of classes. For example, for 900 cases, this second formula gives k equal to 10.81,
which is the identical result. For the thirty-observation case, the second formula gives
5.91, which is almost identical. It has been pointed out in various places (see Hyndman,
1995) that Sturges’ rule is good only for samples less than 200, and that it is based on a
flawed argument. Nevertheless, it is still the standard used by most statistical software
packages. There are two other rules that may be used, and these are discussed later in
this chapter, because they utilise statistical measures that have not been discussed at
this point. All the rules produce similar results for small samples, but diverge as the
sample size becomes increasingly large. The other possible problems with Sturges’
rule are, first, that it may lead to over-smoothed data and, second, that its requirement
for equal intervals may hide important information.
Stem and leaf displaysâ•… Another way to display discrete data is to use a
stem and leaf display. Essentially, the stem is the most aggregate level of grouping
of the data, while the leaf is made up of more disaggregate data. Table 2.5 shows
some household data when the actual income was collected, rather than having people
respond to pre-defined classes.
A stem and leaf display would be constructed, for example, by using the tens of
thousands of dollars as the stem and the thousands as the leaf. This, like a histogram,
provides a picture of the distribution of the data, as shown in Figure 2.11. This graphic
shows clearly the nature of the distribution of incomes.
Central measures of data
There are at least six different averages that can be computed, which provide different
ways of assessing the central value of the data. The six that are discussed here are:
(1) arithmetic mean;
(2) median;
Table 2.4 Grouped temperature data
Temperature
range
Number
of highs
Number
of lows
Cumulative
number of highs
Cumulative
number of lows
16–17 0 4 0 ╇ 4
18–19 0 9 0 13
20–21 0 12 0 25
22–23 2 5 4 30
24–25 5 0 7 30
26–27 9 0 16 30
28–29 7 0 23 30
30–31 4 0 27 30
32–33 3 0 30 30
50. Describing data 23
(3) mode;
(4) geometric mean;
(5) harmonic mean; and
(6) quadratic mean.
The arithmetic mean The arithmetic mean is simply the total of all the
values in the data divided by the number of elements in the sample that provided valid
values for the statistic.
Mathematically, it is usually written as equation (2.3):
x
x
n
ii
n
1 (2.3)
In words, the mean of the variable x is equal to the sum of all the values of x in the data
set, divided by the number of observations, n. It is important to note that values of x
that contribute to the estimation of the mean are only those that are valid, and that n is
also a count of the valid observations. Thus, in the income data we used previously, the
missing values would be removed, and a mean, if it was calculated, would be based on
900 observations, not on the 1,070 survey returns.
The sample mean – i.e., the value of the mean estimated from a sample of observa-
tions – is normally denoted by the symbol x̅, while the true mean from the population
is denoted by the Greek letter μ. It is a convention in statistics to use Greek letters to
denote true population values, and the equivalent Roman letter to denote the sample
estimate of that value. Put another way, the parameter is denoted by a Greek letter, and
the statistic by the equivalent Roman letter.
Using the temperature data from Table 2.3, the sum of the maximum temperatures
is found to be 825, which yields an arithmetic mean of 27.5°C. Similarly, the sum of
the minimum temperatures is 591, which gives an arithmetic mean of 19.7°C. In each
of these cases there were thirty valid observations, so the total or sum was divided by
thirty to give the arithmetic mean. Similarly, using the income data from Table 2.5, the
sum of the incomes is $2,248,437. With sixty valid observations of income, the arith-
metic mean of income is $37,474.
The arithmetic mean (usually referred to simply as the mean, because it is the mean
most often used) can also be understood by considering it as being the centre of gravity
of the data. This is shown in Figure 2.12. In each of the two distributions shown in the
figure, the fulcrum or balance point represents the mean. In the distribution on the left
the mean is at thirteen, while in the one on the right it is at fourteen.
Figure 2.12 illustrates two important facts. First, the symmetry or lack of it in a distri-
bution of values will affect where the mean falls. Second, the arithmetic mean is influ-
enced by extreme values. If the value of twenty were removed from the data distribution
on the right of Figure 2.12, the mean would shift to thirteen. On the other hand, if the
extreme value had been at twenty-five instead of twenty, the mean value would shift to
51. Basic statistics and probability24
14.5. These changes come about by changing one out of nine observations, suggesting
some substantial sensitivity of the mean to a relatively small change in the data.
The medianâ•… The median is the central value of the data, or it can be
defined to be the value for which half the data are above the value and half are below.
For any data, the median value is most easily found by ordering the data in increasing
or decreasing value and then finding the midpoint value. For the temperature data, this
is seen fairly easily in the grouped data of Table 2.4. For the maximum temperature,
the dividing point between the first fifteen values and the last fifteen values is found at
27°C, which is therefore the median value. Similarly, for the minimum temperatures,
the median is 20°C. Note that the median must be a whole number of degrees in these
cases, because the data are reported only in whole numbers of degrees. Note that the
medians of each of these two variables are not exactly the same as the means, although
they are very close.
For already grouped data, the median must be a range. Looking back at the income
data in Table 2.2, and using the cumulative proportions with the missing data excluded,
it can be seen that the median falls in the interval $26,000–$36,399. For the income
data in Table 2.5, the median can be an actual value. However, because there is an even
number of observations, the median actually falls between the thirtieth and the thirty-
�first observations, so between $32,568 and $34,288. By interpolation, the median
would be $33,428. Comparing this to the mean, it is noted that the mean is quite a bit
higher at $37,474.
The modeâ•… The mode is the most frequently occurring value in a set of
observations. For the maximum temperature data, the mode occurs at 27°C, for which
there are five observations. For the minimum temperature, the mode occurs at 20°C,
for which there are eight days on which this temperature occurs. For the income data
in Table 2.2, the mode is $8,320–$15,599. This is quite different from the median. For
the income data in Table 2.5, there is no mode for the ungrouped data, because each
value is unique. To find a mode, it is necessary to group the data. This has, effectively,
been done in the stem and leaf display, from which it can be determined that the mode
11 12 13 14 15 11 12 13 14 15 16 17 18 19 20
Figure 2.12╇ Arithmetic mean as centre of gravity
Source: Ewart, Ford, and Lin (1982: 38).
52. Describing data 25
is in the range of $20,000–$24,999, which contains eight households. Using classes
of $5,000 for the ranges, there is no other range that has as many households in it. If
ranges of $10,000 were used, then the mode would be in the range $20,000–$29,999.
Unlike all the other mean values, there may be more than one mode. In fact, the limit
on the number of modes that can occur is the number of observations, if each value
occurs only once in the data set. However, this is not a useful result, and data in which
each value occurs only once, as in the income data, should be grouped to provide more
useful information. Data may be distributed bimodally or trimodally, or more. This
means that there will be multiple peaks in the data distribution. Figure 2.13 shows a
possible bimodal distribution of daily maximum temperatures. There are two modes in
the underlying data, one at 23°C and one at 27°C. Knowing that there are two modes
in a data set provides information on the appearance of the underlying distribution, as
shown in Figure 2.13.
The geometric mean The geometric mean is similar to the arithmetic
mean, except that it is determined from the product of all the values, not the sum, and
the nth root of the product is taken, rather than dividing by n. Thus, the geometric mean
is written as shown in equation (2.4):
x xg ix xg ix x
i
n n
x xx xx xg ix xx xg ix xx xx xg ig ix xg ix xx xg ix xx xx xx xg ix xx xg ix xx xx xx xx xg ig ig ig ix xg ix xx xg ix xx xg ix xx xg ix xx xx xg ig ix xg ix xx xg ix x
1
1
(2.4)
It is most useful when looking at growth over time periods. For example, suppose an
individual had investments in a mutual fund over a period of twelve years, and the
fund experienced the growth rates shown in Table 2.6. The question one might like to
ask is: ‘What is the average annual growth rate over the twelve years?’ If one were to
estimate this using the arithmetic mean, one would obtain the answer that the average
growth rate is 5.85 per cent. However, the geometric mean produces a value of 5.77
per cent. Although this difference does not appear to be numerically large, it has a sig-
nificant effect on calculations of the value of the investment at the end of twelve years.
If one assumes that the actual initial investment was $10,000, then the actual fund
0
1
2
3
4
5
6
7
8
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
Maximum daily temperatures
Daysofoccurrence
Figure 2.13 Bimodal distribution of temperatures
53. Basic statistics and probability26
would stand at $19,609.82. This is exactly the result that would be obtained by using
the geometric mean. However, the arithmetic mean would estimate the fund as being
$19,782.92 – a difference of $173.10.
The arithmetic mean is obtained from equation (2.5):
x ( .( . . .. . . .. . . . . .. . .( .1( .( .1( .052052 1 067. .067. .11. .1. .103103 1 139. .139. .11. .1. .116116 1 065 11 059059. . .059. . .. . .059. . .11. . .1. . .038038. . .038. . .. . .038. . .1. . .1. . .002
1.. . . ) .) . .021 1 016. .016. .1. .1. .024) .12) .) .12) .702 12 1 0585. .. .11 016016. .016. .. .016. . ) .) .) .12) .) .12) .702702 1212
(2.5)
This produces an estimated annual average growth rate of 5.85 per cent. Using this
to estimate the actual value of the fund at the end of twelve years, assuming an initial
investment of $10,000, one would calculate equation (2.6):
V12 $10,000 (1.0585)12
$10,000 1.978292 $19,782.92 (2.6)
The geometric mean is obtained from equation (2.7):
xg ( .( . . .. . . .. . . . . .. . .( .1( .( .1( .052052 1 067. .067. .11. .1. .. .1. .103103 1 139. .139. .11. .1. .. .1. .116116 1 065 11 059059. . .059. . .. . .059. . .11. . .1. . .. . .1. . .038038. . .038. . .. . .038. . .1. . .1. . .002
11 021 1 016 1 1 96058 1 0577
1
12
1
12. .021. .021 1. .1 . )024. )024 . .1. .1. .96058. .9605811 016016. .. .1. .11. .1 . .. .
(2.7)
This produces the estimated annual geometric mean growth rate of 5.77 per cent. To
estimate the value of the fund at the end of twelve years, one estimates in the same
manner as for the arithmetic mean, as in equation (2.8):
V12 $10,000 (1.0577)12
$10,000 1.960982 $19,609.82 (2.8)
The reader can readily verify that this is identical to the amount calculated by apply-
ing each year’s growth rate, compounded, to the amount of the fund at the end of the
Table 2.6 Growth rates of an
investment fund, 1993–2004
Year
Growth
(percentage charge)
1993 5.20
1994 6.70
1995 10.30
1996 13.90
1997 11.60
1998 6.50
1999 5.90
2000 3.80
2001 0.20
2002 2.10
2003 1.60
2004 2.40
54. Describing data 27
previous year. Note that both the arithmetic and geometric means are obtained by using
the compounding formula (1 growth rate) to obtain the average rate of growth.
The harmonic mean The harmonic mean is obtained by summing the
inverse of the values for each observation, taking the inverse of this value, and multi-
plying the result by the number of observations. It may be written as shown in equa-
tion (2.9):
x
n
x
h
ii
n
11
/
(2.9)
The harmonic mean is used to estimate a mean from rates such as rates by time or
distance. A good example would be provided by estimating the average speed of a
train when the train’s speed changes every one kilometre, because of track condition,
signals, and congestion. Suppose that the speeds for each kilometre of a twenty kilo-
metre train trip were as shown in Table 2.7. If one were to take the arithmetic mean,
this would give a mean speed of 58.25 kilometres per hour (km/h). This would suggest
Table 2.7 Speeds by kilometre for a train
Kilometre
of trip
Speed
(km/h)
Time taken
(minutes)
1 40 1.5
2 45 1.333
3 55 1.091
4 60 1
5 70 0.857
6 65 0.923
7 50 1.2
8 35 1.714
9 40 1.5
10 60 1
11 70 0.857
12 80 0.75
13 100 0.6
14 90 0.667
15 70 0.857
16 60 1
17 60 1
18 45 1.333
19 40 1.5
20 30 2
Total – 22.683
55. Basic statistics and probability28
that the time taken for the trip was 20 60 / 58.25 minutes, or 20.6 minutes, when it
was actually 22.7 minutes (see Table 2.7). The harmonic mean is calculated as shown
in equation (2.10):
xg
20
1
40
1
45
1
55
1
60
1
70
1
65
1
50
1
35
1
40
1
60
1
70
1
80
1
100
11
90
1
70
1
60
1
60
1
45
1
40
1
30
20
0 37805
52 903
.
. (2.10)
This gives a harmonic mean speed of 52.903 km/h. Using this figure, rather than the
arithmetic mean speed, the time taken for the twenty kilometre trip is 20 60 / 52.903
minutes, or 22.7 minutes, which is the correct figure.
The quadratic mean The quadratic mean is also known as the root mean
square (RMS). It is given by summing the squared values of the observations, dividing
these by the number of observations, and taking the square root of the result, as shown
in equation (2.11):
RMS
x
n
ii
n 2
1 (2.11)
The quadratic mean is most often used with data whose arithmetic mean is zero. It
is often used for estimating error when the expected value of the average error is zero.
For example, suppose that one is assessing the accuracy of a machine that produces
ball bearings of nominally 100 millimetres (mm) in diameter. Measurements are taken
of a number of ball bearings, and the actual diameters found to be those shown in
Table 2.8, which also shows how much each one deviates from 100 mm.
The arithmetic mean of the deviations is 0.11 mm. However, the RMS is ±0.81
mm. The latter value gives a much clearer idea of the amount by which the ball bear-
ings actually deviate from the desired diameter, because it does not allow the negative
values to compensate for the positive ones. It shows, more precisely, the tolerance in
the manufacturing process.
Relationships between mean (arithmetic), median, and mode There
are relationships between the arithmetic mean (referred to hereafter as the mean), the
median, and the mode that can tell us more about the underlying data. In the tempera-
ture data from Table 2.3, it was found that the mean high temperature was 27.5°C, the
median was 27°C, and the mode occurred at 27°C. In this case, it can be seen that the
mode, median, and mean are all quite close. For the low temperatures, the mean was
19.7°C, the median was 20°C, and the mode was also 20°C. Again, the values are very
similar. In contrast, for the income data of Table 2.5, the mean is $37,474, the median
is $33,428, and the mode would be in the range $20,000–$24,999. These values are
not particularly close.
56. Describing data 29
For the mean, mode, and median to be the same value, the data must be distributed
symmetrically around the mean and median, and the distribution must be unimodal€–
i.e., have one mode€– which must occur at the mean value.
Plotting the temperature data, as shown in Figure 2.14 for the high temperatures
and Figure 2.15 for the low temperatures, shows distributions that are very nearly
symmetrical and that meet the conditions for a coincidence of mean, mode, and
median.
Using Sturges’ rule, with sixty observations on income, incomes should be grouped
into seven equal steps. This can be done by setting the intervals to $15,000. The result
is shown in Figure 2.16. In contrast to the temperature data, Figure 2.16 shows that
the data are not symmetrical but, rather, that they are skewed to the right, meaning
that there is a longer tail to the distribution to the right than to the left. This leads to a
median and a mode that are both below the mean.
Table 2.8 Measurements of ball bearings
Ball bearing Diameter (mm) Deviations
1 ╇ 98.5 −1.5
2 100.2 ╇ 0.2
3 99.6 −0.4
4 98.9 −1.1
5 100.6 0.6
6 100.3 0.3
7 100.7 0.7
8 99.1 −0.9
9 99.9 −0.1
10 101.1 1.1
0
1
2
3
4
5
Frequency
22 23 24 25 26 27 28 29 30 31 32 33
Maximum temperature
Figure 2.14╇ Distribution of maximum temperatures from Table 2.4