SlideShare a Scribd company logo
1 of 561
Download to read offline
Collecting, Managing, and Assessing Data
Using Sample Surveys
Collecting, Managing, and Assessing Data Using Sample Surveys
provides a thorough, step-by-step guide to the design and imple-
mentation of surveys. Beginning with a primer on basic statistics,
the first half of the book takes readers on a comprehensive tour
through the basics of survey design. Topics covered include the
ethics of surveys, the design of survey procedures, the design of
the survey instrument, how to write questions, and how to draw
representative samples. Having shown readers how to design sur-
veys, the second half of the book discusses a number of issues sur-
rounding their implementation, including repetitive surveys, the
economics of surveys, Web-based surveys, coding and data entry,
data expansion and weighting, the issue of nonresponse, and the
documenting and archiving of survey data. The book is an excel-
lent introduction to the use of surveys for graduate students as well
as a useful reference work for scholars and professionals.
peter stopher is Professor of Transport Planning at the
Institute of Transport and Logistics Studies at the University of
Sydney. He has also been a professor at Northwestern University,
Cornell University, McMaster University, and Louisiana State
University. Professor Stopher has developed a substantial reputa-
tion in the field of data collection, particularly for the support of
travel forecasting and analysis. He pioneered the development of
travel and activity diaries as a data collection mechanism, and has
written extensively on issues of sample design, data expansion,
nonresponse biases, and measurement issues.

Collecting, Managing, and
Assessing Data Using
Sample Surveys
Peter Stopher
CAMBRIDGE UNIVERSITY PRESS
Cambridge, New York, Melbourne, Madrid, Cape Town,
Singapore, São Paulo, Delhi, Tokyo, Mexico City
Cambridge University Press
The Edinburgh Building, Cambridge CB2 8RU, UK
Published in the United States of America by Cambridge University Press, New York
www.cambridge.org
Information on this title: www.cambridge.org/9780521681872
© Peter Stopher 2012
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2012
Printed in the United Kingdom at the University Press, Cambridge
A catalogue record for this publication is available from the British Library
ISBN 978-0-521-86311-7 Hardback
ISBN 978-0-521-68187-2 Paperback
Cambridge University Press has no responsibility for the persistence or
accuracy of URLs for external or third-party internet websites referred to
in this publication, and does not guarantee that any content on such
websites is, or will remain, accurate or appropriate.

To my wife, Carmen, with grateful thanks for your faith in me and your continuing
support and encouragement.
vii
List of figures	 page╇ xix
List of tables	 xxii
Acknowledgements	 xxv
	 1	 Introduction	 1
1.1	 The purpose of this book	 1
1.2	 Scope of the book	 2
1.3	 Survey statistics	 4
	 2	 Basic statistics and probability	 6
2.1	 Some definitions in statistics	 6
2.1.1	 Censuses and surveys	 7
2.2	 Describing data	 8
2.2.1	 Types of scales	 8
Nominal scales	 8
Ordinal scales	 9
Interval scales	 9
Ratio scales	 10
Measurement scales	 10
2.2.2	 Data presentation: graphics	 11
2.2.3	 Data presentation: non-graphical	 16
Measures of magnitude	 17
Frequencies and proportions	 17
Central measures of data	 21
Measures of dispersion	 34
The normal distribution	 45
Some useful properties of variances and standard deviations	 46
Proportions or probabilities	 47
Data transformations	 48
Covariance and correlation	 50
Coefficient of variation	 51
Contents
Contentsviii
Other measures of variability	 53
Alternatives to Sturges’rule	 62
	 3	 Basic issues in surveys	 64
3.1	 Need for survey methods	 64
3.1.1	 A definition of sampling methodology	 65
3.2	 Surveys and censuses	 65
3.2.1	 Costs	 66
3.2.2	 Time	 67
3.3	 Representativeness	 68
3.3.1	 Randomness	 69
3.3.2	 Probability sampling	 70
Sources of random numbers	 71
3.4	 Errors and bias	 71
3.4.1	 Sample design and sampling error	 73
3.4.2	 Bias	 74
3.4.3	 Avoiding bias	 78
3.5	 Some important definitions	 78
	 4	 Ethics of surveys of human populations	 81
4.1	 Why ethics?	 81
4.2	 Codes of ethics or practice	 82
4.3	 Potential threats to confidentiality	 84
4.3.1	 Retaining detail and confidentiality	 85
4.4	 Informed consent	 86
4.5	 Conclusions	 89
	 5	 Designing a survey	 91
5.1	 Components of survey design	 91
5.2	 Defining the survey purpose	 93
5.2.1	 Components of survey purpose	 94
Data needs	 94
Comparability or innovation	 97
Defining data needs	 99
Data needs in human subject surveys	 99
Survey timing	 100
Geographic bounds for the survey	 101
5.3	 Trade-offs in survey design	 102
	 6	 Methods for conducting surveys of human populations	 104
6.1	 Overview	 104
6.2	 Face-to-face interviews	 105
6.3	 Postal surveys	 107
Contents ix
6.4	 Telephone surveys	 108
6.5	 Internet surveys	 111
6.6	 Compound survey methods	 112
6.6.1	 Pre-recruitment contact	 112
6.6.2	 Recruitment	 113
Random digit dialling	 115
6.6.3	 Survey delivery	 117
6.6.4	 Data collection	 118
6.6.5	 An example	 119
6.7	 Mixed-mode surveys	 120
6.7.1	 Increasing response and reducing bias	 123
6.8	 Observational surveys	 125
	 7	 Focus groups	 127
7.1	 Introduction	 127
7.2	 Definition of a focus group	 128
7.2.1	 The size and number of focus groups	 128
7.2.2	 How a focus group functions	 129
7.2.3	 Analysing the focus group discussions	 131
7.2.4	 Some disadvantages of focus groups	 131
7.3	 Using focus groups to design a survey	 132
7.4	 Using focus groups to evaluate a survey	 134
7.5	 Summary	 135
	 8	 Design of survey instruments	 137
8.1	 Scope of this chapter	 137
8.2	 Question type	 137
8.2.1	 Classification and behaviour questions	 138
Mitigating threatening questions	 139
8.2.2	 Memory or recall error	 142
8.3	 Question format	 145
8.3.1	 Open questions	 145
8.3.2	 Field-coded questions	 146
8.3.3	 Closed questions	 147
8.4	 Physical layout of the survey instrument	 150
8.4.1	 Introduction	 150
8.4.2	 Question ordering	 153
Opening questions	 153
Body of the survey	 154
The end of the questionnaire	 158
8.4.3	 Some general issues on question layout	 159
Overall format	 160
Contentsx
Appearance of the survey	 161
Front cover	 162
Spatial layout	 163
Choice of typeface	 164
Use of colour and graphics	 166
Question numbering	 169
Page breaks	 170
Repeated questions	 171
Instructions	 172
Show cards	 174
Time of the interview	 174
Precoding	 174
End of the survey	 175
Some final comments on questionnaire layout	 176
	 9	 Design of questions and question wording	 177
9.1	 Introduction	 177
9.2	 Issues in writing questions	 178
9.2.1	 Requiring an answer	 178
9.2.2	 Ready answers	 180
9.2.3	 Accurate recall and reporting	 181
9.2.4	 Revealing the data	 182
9.2.5	 Motivation to answer	 183
9.2.6	 Influences on response categories	 184
9.2.7	 Use of categories and other responses	 185
Ordered and unordered categories	 187
9.3	 Principles for writing questions	 188
9.3.1	 Use simple language	 189
9.3.2	 Number of words	 190
9.3.3	 Avoid using vague words	 191
9.3.4	 Avoid using ‘Tick all that apply’ formats	 193
9.3.5	 Develop response categories that are mutually exclusive
and exhaustive	 193
9.3.6	 Make sure that questions are technically correct	 195
9.3.7	 Do not ask respondents to say ‘Yes’ in order to say ‘No’	 196
9.3.8	 Avoid double-barrelled questions	 196
9.4	 Conclusion	 197
	 10	 Special issues for qualitative and preference surveys	 199
10.1	 Introduction	 199
10.2	 Designing qualitative questions	 199
10.2.1╇ Scaling questions	 200
Contents xi
10.3	 Stated response questions	 206
10.3.1╇ The hypothetical situation	 206
10.3.2╇ Determining attribute levels	 207
10.3.3╇ Number of choice alternatives or scenarios	 207
10.3.4╇ Other issues of concern	 208
Data inconsistency	 208
Lexicographic responses	 209
Random responses	 209
10.4	 Some concluding comments on stated response survey design	 210
	 11	 Design of data collection procedures	 211
11.1	 Introduction	 211
11.2	 Contacting respondents	 211
11.2.1╇ Pre-notification contacts	 211
11.2.2╇ Number and type of contacts	 213
Nature of reminder contacts	 213
Postal surveys	 215
Postal surveys with telephone recruitment	 216
Telephone interviews	 217
Face-to-face interviews	 219
Internet surveys	 220
11.3	 Who should respond to the survey?	 221
11.3.1╇ Targeted person	 221
11.3.2╇ Full household surveys	 223
Proxy reporting	 224
11.4	 Defining a complete response	 225
11.4.1╇ Completeness of the data items	 226
11.4.2╇ Completeness of aggregate sampling units	 228
11.5	 Sample replacement	 229
11.5.1╇ When to replace a sample unit	 229
11.5.2╇ How to replace a sample	 233
11.6	 Incentives	 235
11.6.1╇ Recommendations on incentives	 236
11.7	 Respondent burden	 240
11.7.1╇ Past experience	 241
11.7.2╇ Appropriate moment	 242
11.7.3╇ Perceived relevance	 242
11.7.4╇ Difficulty	 243
Physical difficulty	 243
Intellectual difficulty	 244
Emotional difficulty	 245
Reducing difficulty	 246
Contentsxii
11.7.5╇ External factors	 246
Attitudes and opinions of others	 246
The ‘feel good’effect	 247
Appropriateness of the medium	 248
11.7.6╇ Mitigating respondent burden	 248
11.8	 Concluding comments	 250
	 12	 Pilot surveys and pretests	 251
12.1	 Introduction	 251
12.2	 Definitions	 252
12.3	 Selecting respondents for pretests and pilot surveys	 255
12.3.1╇ Selecting respondents	 255
12.3.2╇ Sample size	 258
Pilot surveys	 258
Pretests	 261
12.4	 Costs and time requirements of pretests and pilot surveys	 262
12.5	 Concluding comments	 264
	 13	 Sample design and sampling	 265
13.1	 Introduction	 265
13.2	 Sampling frames	 266
13.3	 Random sampling procedures	 268
13.3.1╇ Initial considerations	 268
13.3.2╇ The normal law of error	 269
13.4	 Random sampling methods	 270
13.4.1╇ Simple random sampling	 271
Drawing the sample	 271
Estimating population statistics and sampling errors	 273
Example	 276
Sampling from a finite population	 279
Sampling error of ratios and proportions	 279
Defining the sample size	 281
Examples	 283
13.4.2╇ Stratified sampling	 285
Types of stratified samples	 285
Study domains and strata	 287
Weighted means and variances	 287
Stratified sampling with a uniform sampling fraction	 289
Drawing the sample	 289
Estimating population statistics and sampling errors	 290
Pre- and post-stratification	 291
Example	 293
Contents xiii
Equal allocation	 294
Summary of proportionate sampling	 295
Stratified sampling with variable sampling fraction	 295
Drawing the sample	 295
Estimating population statistics and sampling errors	 296
Non-coincident study domains and strata	 296
Optimum allocation and economic design	 297
Example	 298
Survey costs differing by stratum	 300
Example	 301
Practical issues in drawing disproportionate samples	 303
Concluding comments on disproportionate sampling	 305
13.4.3╇ Multistage sampling	 305
Drawing a multistage sample	 306
Requirements for multistage sampling	 307
Estimating population values and sampling statistics	 308
Example	 309
Concluding comments on multistage sampling	 314
13.5	 Quasi-random sampling methods	 314
13.5.1╇ Cluster sampling	 316
Equal clusters: population values and standard errors	 317
Example	 319
The effects of clustering	 321
Unequal clusters: population values and standard errors	 322
Random selection of unequal clusters	 324
Example	 325
Stratified sampling of unequal clusters	 326
Paired selection of unequal-sized clusters	 327
13.5.2╇ Systematic sampling	 328
Population values and standard errors in a systematic
sample	 328
Simple random model	 329
Stratified random model	 329
Paired selection model	 329
Successive difference model	 330
Example	 330
13.5.3╇ Choice-based sampling	 333
13.6	 Non-random sampling methods	 334
13.6.1╇ Quota sampling	 334
13.6.2╇ Intentional, judgemental, or expert samples	 335
13.6.3╇ Haphazard samples	 335
13.6.4╇ Convenience samples	 336
13.7	 Summary	 336
Contentsxiv
	 14	 Repetitive surveys	 337
14.1	 Introduction	 337
14.2	 Non-overlapping samples	 338
14.3	 Incomplete overlap	 339
14.4	 Subsampling on the second and subsequent occasions	 341
14.5	 Complete overlap: a panel	 342
14.6	 Practical issues in designing and conducting panel surveys	 343
14.6.1╇ Attrition	 344
Replacement of panel members lost by attrition	 345
Reducing losses due to attrition	 346
14.6.2╇ Contamination	 347
14.6.3╇ Conditioning	 348
14.7	 Advantages and disadvantages of panels	 348
14.8	 Methods for administering practical panel surveys	 349
14.9	 Continuous surveys	 352
	 15	 Survey economics	 356
15.1	 Introduction	 356
15.2	 Cost elements in survey design	 357
15.3	 Trade-offs in survey design	 359
15.3.1╇ Postal surveys	 360
15.3.2╇�Telephone recruitment with a postal survey with or
without telephone retrieval� 361
15.3.3╇ Face-to-face interview	 362
15.3.4╇ More on potential trade-offs	 362
15.4	 Concluding comments	 363
	 16	 Survey implementation	 365
16.1	 Introduction	 365
16.2	 Interviewer selection and training	 365
16.2.1╇ Interviewer selection	 365
16.2.2╇ Interviewer training	 368
16.2.3╇ Interviewer monitoring	 369
16.3	 Record keeping	 370
16.4	 Survey supervision	 372
16.5	 Survey publicity	 373
16.5.1╇ Frequently asked questions, fact sheet, or brochure	 374
16.6	 Storage of survey forms	 374
16.6.1╇ Identification numbers	 375
16.7	 Issues for surveys using posted materials	 377
16.8	 Issues for surveys using telephone contact	 377
16.8.1╇ Caller ID	 378
16.8.2╇ Answering machines	 378
Contents xv
16.8.3╇ Repeated requests for callback	 380
16.9	 Data on incomplete responses	 381
16.10╇ Checking survey responses	 382
16.11╇ Times to avoid data collection	 383
16.12╇ Summary comments on survey implementation	 383
	 17	 Web-based surveys	 385
17.1	 Introduction	 385
17.2	 The internet as an optional response mechanism	 388
17.3	 Some design issues for Web surveys	 389
17.3.1╇ Differences between paper and internet surveys	 389
17.3.2╇ Question and response	 390
17.3.3╇ Ability to fill in the Web survey in multiple sittings	 392
17.3.4╇ Progress tracking	 393
17.3.5╇ Pre-filled responses	 394
17.3.6╇ Confidentiality in Web-based surveys	 395
17.3.7╇ Pictures, maps, etc. on Web surveys	 395
Animation in survey pictures and maps	 396
17.3.8╇ Browser software	 396
User interface design	 396
Creating mock-ups	 397
Page loading time	 398
17.4	 Some design principles for Web surveys	 398
17.5	 Concluding comments	 399
	 18	 Coding and data entry	 401
18.1	 Introduction	 401
18.2	 Coding	 402
18.2.1╇ Coding of missing values	 402
18.2.2╇ Use of zeros and blanks in coding	 403
18.2.3╇ Coding consistency	 404
Binary variables	 404
Numeric variables	 404
18.2.4╇ Coding complex variables	 405
18.2.5╇ Geocoding	 406
Requesting address details for other places than home	 408
Pre-coding of buildings	 409
Interactive gazetteers	 410
Other forms of geocoding assistance	 410
Locating by mapping software	 411
18.2.6╇ Methods for creating codes	 412
18.3	 Data entry	 413
18.4	 Data repair	 416
Contentsxvi
	 19	 Data expansion and weighting	 418
19.1	 Introduction	 418
19.2	 Data expansion	 419
19.2.1╇ Simple random sampling	 419
19.2.2╇ Stratified sampling	 419
19.2.3╇ Multistage sampling	 420
19.2.4╇ Cluster samples	 420
19.2.5╇ Other sampling methods	 421
19.3	 Data weighting	 421
19.3.1╇ Weighting with unknown population totals	 422
An example	 423
A second example	 424
19.3.2╇ Weighting with known populations	 426
An example	 427
19.4	 Summary	 429
	 20	 Nonresponse	 431
20.1	 Introduction	 431
20.2	 Unit nonresponse	 432
20.2.1╇ Calculating response rates	 432
Classifying responses to a survey	 433
Calculating response rates	 435
20.2.2╇ Reducing nonresponse and increasing response rates	 440
Design issues affecting nonresponse	 440
Survey publicity	 442
Use of incentives	 442
Use of reminders and repeat contacts	 443
Personalisation	 444
Summary	 445
20.2.3╇ Nonresponse surveys	 445
20.3	 Item nonresponse	 450
20.3.1╇ Data repair	 450
Flagging repaired variables	 451
Inference	 452
Imputation	 452
Historical imputation	 453
Average imputation	 454
Ratio imputation	 454
Regression imputation	 455
Cold-deck imputation	 456
Hot-deck imputation	 457
Expectation maximisation	 457
Contents xvii
Multiple imputation	 458
Imputation using neural networks	 458
Summary of imputation methods	 460
20.3.2╇ A final note on item nonresponse	 460
Strategies to obtain age and income	 461
Age	 461
Income	 462
	 21	 Measuring data quality	 464
21.1	 Introduction	 464
21.2	 General measures of data quality	 464
21.2.1╇ Missing value statistic	 465
21.2.2╇ Data cleaning statistic	 466
21.2.3╇ Coverage error	 467
21.2.4╇ Sample bias	 468
21.3	 Specific measures of data quality	 469
21.3.1╇ Non-mobility rates	 469
21.3.2╇ Trip rates and activity rates	 470
21.3.3╇ Proxy reporting	 471
21.4	 Validation surveys	 472
21.4.1╇ Follow-up questions	 473
21.4.2╇ Independent measurement	 475
21.5	 Adherence to quality measures and guidance	 476
	 22	 Future directions in survey procedures	 478
22.1	 Dangers of forecasting new directions	 478
22.2	 Some current issues	 478
22.2.1╇ Reliance on telephones	 478
Threats to the use of telephone surveys	 479
Conclusions on reliance on telephones	 481
22.2.2╇ Language and literacy	 481
Language	 481
Literacy	 483
22.2.3╇ Mixed-mode surveys	 486
22.2.4╇ Use of administrative data	 487
22.2.5╇ Proxy reporting	 488
22.3	 Some possible future directions	 489
22.3.1╇�A GPS survey as a potential substitute for a household
travel survey� 493
The effect of multiple observations of each respondent
on sample size	 495
Contentsxviii
	 23	 Documenting and archiving	 499
23.1	 Introduction	 499
23.2	 Documentation or the creation of metadata	 499
23.2.1╇ Descriptive metadata	 500
23.2.2╇ Preservation metadata	 503
23.2.3╇ Geospatial metadata	 503
23.3	 Archiving of data	 506
References	 511
Index� 525
xix
Figures
	 2.1	 Scatter plot of odometer reading versus model year� page 12
	 2.2	 Scatter plot of fuel type by body type	 12
	 2.3	 Pie chart of vehicle body types	 13
	 2.4	 Pie chart of household income groups	 13
	 2.5	 Histogram of household income	 14
	 2.6	 Histogram of vehicle types	 14
	 2.7	 Line graph of maximum and minimum temperatures for thirty days	 15
	 2.8	 Ogive of cumulative household income data from Figure 2.5	 16
	 2.9	 Relative ogive of household income	 16
	 2.10	 Relative step chart of household income	 17
	 2.11	 Stem and leaf display of income	 22
	 2.12	 Arithmetic mean as centre of gravity	 24
	 2.13	 Bimodal distribution of temperatures	 25
	 2.14	 Distribution of maximum temperatures from Table 2.4	 29
	 2.15	 Distribution of minimum temperatures from Table 2.4	 30
	 2.16	 Income distribution from Table 2.5	 30
	 2.17	 Distribution of vehicle counts	 33
	 2.18	 Box and whisker plot of income data from Table 2.5	 36
	 2.19	 Box and whisker plot of maximum temperatures	 37
	 2.20	 Box and whisker plot of minimum temperatures	 37
	 2.21	 Box and whisker plot of vehicles passing through the green phase	 43
	 2.22	 Box and whisker plot of children’s ages	 45
	 2.23	 The normal distribution	 45
	 2.24	 Comparison of normal distributions with different variances	 46
	 2.25	 Scatter plot of maximum versus minimum temperature	 52
	 2.26	 A distribution skewed to the right	 54
	 2.27	 A distribution skewed to the left	 54
	 2.28	 Distribution with low kurtosis	 55
	 2.29	 Distribution with high kurtosis	 55
	 3.1	 Extract of random numbers from the RAND Million Random Digits	 72
	 4.1	 Example of a consent form	 87
List of figuresxx
	 4.2	 First page of an example subject information sheet	 88
	 4.3	 Second page of the example subject information sheet	 89
	 5.1	 Schematic of the survey process	 92
	 5.2	 Survey design trade-offs	 103
	 6.1	 Schematic of survey methods	 113
	 8.1	 Document file layout for booklet printing	 162
	 8.2	 Example of an unacceptable questionnaire format	 164
	 8.3	 Example of an acceptable questionnaire format	 165
	 8.4	 Excerpt from a survey showing arrows to guide respondent	 168
	 8.5	 Extract from a questionnaire showing use of graphics	 169
	 8.6	 Columned layout for asking identical questions about multiple people	 171
	 8.7	 Inefficient and efficient structures for organising serial questions	 172
	 8.8	 Instructions placed at the point to which they refer	 173
	 8.9	 Example of an unacceptable questionnaire format with response codes	 175
	 9.1	 Example of a sequence of questions that do not require answers	 178
	 9.2	 Example of a sequence of questions that do require answers	 179
	 9.3	 Example of a belief question	 181
	 9.4	 Example of a belief question with a more vague response	 181
	 9.5	 Two alternative response category sets for the age question	 185
	 9.6	 Alternative questions on age	 186
	 9.7	 Examples of questions with unordered response categories	 187
	 9.8	 An example of mixed ordered and unordered categories	 188
	 9.9	 Reformulated question from Figure 9.8	 189
	 9.10	 An unordered alternative to the question in Figure 9.8	 189
	 9.11	 Avoiding vague words in question wording	 192
	 9.12	 Example of a failure to achieve mutual exclusivity and exhaustiveness	 194
	 9.13	 Correction to mutual exclusivity and exhaustiveness	 195
	 9.14	 Example of a double negative	 196
	 9.15	 Example of removal of a double negative	 196
	 9.16	 An alternative that keeps the wording of the measure	 197
	 9.17	 An alternative way to deal with a double-barrelled question	 197
	 10.1	 Example of a qualitative question	 200
	 10.2	 Example of a qualitative question using number categories	 200
	 10.3	 Example of unbalanced positive and negative categories	 201
	 10.4	 Example of balanced positive and negative categories	 201
	 10.5	 Example of placing the neutral option at the end	 202
	 10.6	 Example of distinguishing the neutral option from ‘No opinion’	 202
	 10.7	 Use of columned layout for repeated category responses	 203
	 10.8	 Alternative layout for repeated category responses	 204
	 10.9	 Statements that call for similar responses	 204
	 10.10	 Statements that call for varying responses	 205
	 10.11	 Rephrasing questions to remove requirement for ‘Agree’/‘Disagree’	 206
	 11.1	 Example of a postcard reminder for the first reminder	 215
List of figures xxi
	 11.2	 Framework for understanding respondent burden	 241
	 14.1	 Schematic of the four types of repetitive samples	 338
	 14.2	 Rotating panel showing recruitment, attrition, and rotation	 353
	 18.1	 An unordered set of responses requiring coding	 402
	 18.2	 A possible format for asking for an address	 409
	 18.3	 Excerpt from a mark-sensing survey	 415
	 20.1	 Illustration of the categorisation of response outcomes	 436
	 20.2	 Representation of a neural network model	 459
	 23.1	 Open archival information system model	 508
xxii
	 2.1	 Frequencies and proportions of vehicle types� page 18
	 2.2	 Frequencies, proportions, and cumulative values for household
income	 19
	 2.3	 Minimum and maximum temperatures for a month (°C)	 20
	 2.4	 Grouped temperature data	 21
	 2.5	 Disaggregate household income data	 22
	 2.6	 Growth rates of an investment fund, 1993–2004	 26
	 2.7	 Speeds by kilometre for a train	 27
	 2.8	 Measurements of ball bearings	 29
	 2.9	 Number of vehicles passing through the green phase of a traffic light	 32
	 2.10	 Sorted number of vehicles passing through the green phase	 32
	 2.11	 Number of children by age	 34
	 2.12	 Deviations from the mean for the income data of Table 2.5	 38
	 2.13	 Outcomes from throwing the die twice	 40
	 2.14	 Sorted number of vehicles passing through the green phase	 43
	 2.15	 Deviations for vehicles passing through the green phase	 44
	 2.16	 Values of variance and standard deviation for values of p and q	 47
	 2.17	 Deviations for vehicles passing through the green phase raised to third
and fourth powers	 57
	 2.18	 Deviations from the mean for children’s ages	 58
	 2.19	 Data on household size, annual income, and number of vehicles for
forty households	 59
	 2.20	 Deviations needed for covariance and correlation estimates	 61
	 3.1	 Heights of 100 (fictitious) university students (cm)	 76
	 3.2	 Sample of the first and last five students	 76
	 3.3	 Sample of the first ten students	 76
	 3.4	 Intentional sample of ten students	 77
	 3.5	 Random sample of ten students (in order drawn)	 77
	 3.6	 Summary of results from Tables 3.2 to 3.5	 77
	 6.1	 Internet world usage statistics	 112
Tables
List of tables xxiii
	 6.2	 Mixed-mode survey types (based on Dillman and Tarnai, 1991)	 121
	 11.1	 Selection grid by age and gender	 222
	 13.1	 Partial listing of households for a simple random sample	 272
	 13.2	 Excerpt of random numbers from the RAND Million Random Digits	 273
	 13.3	 Selection of sample of 100 members using four-digit groups from
Table 13.2	 274
	 13.4	 Data from twenty respondents in a fictitious survey	 276
	 13.5	 Sums of squares for population groups	 286
	 13.6	 Data for drawing an optimum household travel survey sample	 299
	 13.7	 Optimal allocation of the 2,000-household sample	 299
	 13.8	 Optimal allocation and expected sampling errors by stratum	 300
	 13.9	 Results of equal allocation for the household travel survey	 300
	 13.10	 Given information for economic design of the optimal allocation	 301
	 13.11	 Preliminary sample sizes and costs for economic design of the
optimum allocation	 301
	 13.12	 Estimation of the final sample size and budget	 302
	 13.13	 Comparison of optimal allocation, equal allocation, and economic
design for $150,000 survey	 302
	 13.14	 Comparison of sampling errors from the three sample designs	 303
	 13.15	 Desired stratum sample sizes and results of recruitment calls	 305
	 13.16	 Distribution of departments and students	 310
	 13.17	 Two-stage sample of students from the university	 311
	 13.18	 Multistage sample using disproportionate sampling at the first stage	 313
	 13.19	 Calculations for standard error from sample in Table 13.18	 315
	 13.20	 Examples of cluster samples	 316
	 13.21	 Cluster sample of doctor’s files	 320
	 13.22	 Random drawing of blocks of dwelling units	 326
	 13.23	 Calculations for paired selections and successive differences	 332
	 18.1	 Potential complex codes for income categories	 406
	 18.2	 Example codes for use of the internet and mobile phones	 407
	 19.1	 Results of an hypothetical household survey	 424
	 19.2	 Calculation of weights for the hypothetical household survey	 424
	 19.3	 Two-way distribution of completed surveys	 424
	 19.4	 Two-way distribution of terminated surveys	 425
	 19.5	 Table 19.3 expressed as percentages	 425
	 19.6	 Sum of the cells in Tables 19.3 and 19.4	 425
	 19.7	 Cells of Table 19.6 as percentages	 426
	 19.8	 Weights derived from Tables 19.7 and 19.5	 426
	 19.9	 Results of an hypothetical household survey compared to
secondary source data	 427
	 19.10	 Two-way distribution of completed surveys by percentage
(originally shown in Table 19.5)	 427
	 19.11	 Results of factoring the rows of Table 19.10	 428
List of tablesxxiv
	 19.12	 Second iteration, in which columns are factored	 428
	 19.13	 Third iteration, in which rows are factored again	 429
	 19.14	 Weights derived from the iterative proportional fitting	 429
	 20.1	 Final disposition codes for RDD telephone surveys	 439
	 23.1	 Preservation metadata elements and description	 504
xxv
As is always the case, many people have assisted in the process that has led to this book.
First, I would like to acknowledge all those, too numerous to mention by name, who
have helped me over the years, to learn and understand some of the basics of design-
ing and implementing surveys. They have been many and they have taught me much
of what I now know in this field. However, having said that, I would particularly like
to acknowledge those whom I have worked with over the past fifteen years or more on
the International Steering Committee for Travel Survey Conferences (ISCTSC), who
have contributed enormously to broadening and deepening my own understandings of
surveys. In particular, I would like to mention, in no particular order, Arnim Meyburg,
Martin Lee-Gosselin, Johanna Zmud, Gerd Sammer, Chester Wilmot, Werner Brög,
Juan de Dios Órtuzar, Manfred Wermuth, Kay Axhausen, Patrick Bonnel, Elaine
Murakami, Tony Richardson, (the late) Pat van der Reis, Peter Jones, Alan Pisarski,
Mary Lynn Tischer, Harry Timmermans, Marina Lombard, Cheryl Stecher, Jean-Loup
Madre, Jimmy Armoogum, and (the late) Ryuichi Kitamura. All these individuals have
inspired and helped me and contributed in various ways to this book, most of them,
probably, without realising that they have done so.
I would also like to acknowledge the support I have received in this endeavour from
the University of Sydney, and especially from the director of the Institute of Transport
and Logistics Studies, Professor David Hensher. Both David and the university have
provided a wide variety of support for the writing and production of this book, for
which I am most grateful.
However, most importantly, I would like to acknowledge the enormous support and
encouragement from my wife, Carmen, and her patience, as I have often spent long
hours on working on this book, and her unquestioning faith in me that I could do it. She
has been an enduring source of strength and inspiration to me. Without her, I doubt that
this book would have been written.
As always, a book can see the light of day only through the encouragement and
support of a publisher and those assisting in the publishing process. I would like to
acknowledge Chris Harrison of Cambridge University Press, who first thought that
this book might be worth publishing and encouraged me to develop the outline for
Acknowledgements
Acknowledgementsxxvi
it, and then provided critical input that has helped to shape the book into what it has
become. I would also like to thank profusely Mike Richardson, who carefully and thor-
oughly copy-edited the manuscript, improving immensely its clarity and complete-
ness. I would also like to thank Joanna Breeze, the production editor at Cambridge.
She has worked with me with all the delays I have caused in the book production, and
has still got this book to publication in a very timely manner. However, as always, and
in spite of the help of these people, any errors that remain in the book are entirely my
responsibility.
Finally, I would like to acknowledge the contributions made by the many students I
have taught over the years in this area of survey design. The interactions we have had,
the feedback I have received, and the enjoyment I have had in being able to teach this
material and see students understand and appreciate what good survey design entails
have been most rewarding and have also contributed to the development of this book. I
hope that they and future students will find this book to be of help to them and a contin-
uing reference to some of those points that we have discussed.
Peter Stopher
Blackheath, New South Wales
August 2011
1
1	 Introduction
1.1â•… The purpose of this book
There are a number of books available that treat various aspects of survey design, sam-
pling, survey implementation, and so forth (examples include Cochran, 1963; Dillman,
1978, 2000; Groves and Couper, 1998; Kish, 1965; Richardson, Ampt, and Meyburg,
1995; andYates, 1965). However, there does not appear to be a single book that covers
all aspects of a survey, from the inception of the survey itself through to archiving the
data. This is the purpose of this book. The reader will find herein a complete treatment
of all aspects of a survey, including all the elements of design, the requirements for
testing and refinement, fielding the survey, coding and analysing the resulting data,
documenting what happened, and archiving the data, so that nothing is lost from what
is inevitably an expensive process.
This book concentrates on surveys of human populations, which are both more chal-
lenging generally and more difficult both to design and to implement than most sur-
veys of non-human populations. In addition, because of the background of the author,
examples are drawn mainly from surveys in the area of transport planning. However,
the examples are purely illustrative; no background is needed in transport planning to
understand the examples, and the principles explained are applicable to any survey that
involves human response to a survey instrument. In spite of this focus on human partic-
ipation in the survey process, there are occasional references to other types of surveys,
especially observational and counting types of surveys.
In writing this book, the author has tried to make this as complete a treatment as pos-
sible. Although extensive references are included to numerous publications and books
in various aspects of measuring data, the reader should be able to find all that he or she
requires within the covers of this book. This includes a chapter on some basic aspects
of statistics and probability that are used subsequently, particularly in the development
of the statistical aspects of surveys.
In summary, then, the purpose of this book is to provide the reader with an exten-
sive and, as far as possible, exhaustive treatment of issues involved in the design and
execution of surveys of human populations. It is the intent that, whether the reader is
a student, a professional who has been asked to design and implement a survey, or
Introduction2
someone attempting to gain a level of knowledge about the survey process, all ques-
tions will be answered within these pages. This is undoubtedly a daunting task. The
reader will be able to judge the extent to which this has been achieved. The book is also
designed that someone who has no prior knowledge of statistics, probability, surveys,
or the purposes to which surveys may be put can pick up and read this book, gaining
knowledge and expertise in doing so. At the same time, this book is designed as a ref-
erence book. To that end, an extensive index is provided, so that the user of this book
who desires information on a particular topic can readily find that topic, either from the
table of contents, or through the index.
1.2â•… Scope of the book
As noted in the previous section, the book starts with a treatment of some basic statis-
tics and probability. The reader who is familiar with this material may find it appro-
priate to skip this chapter. However, for those who have already learnt material of this
type but not used it for a while, as well as those who are unfamiliar with the material,
it is recommended that this chapter be used as a means for review, refreshment, or
even first-time learning. It is then followed by a chapter that outlines some basic issues
of surveys, including a glossary of terms and definitions that will be found helpful
in reading the remainder of the book. A number of fundamental issues, pertinent to
overall survey design, are raised in this chapter. Chapter 4 introduces the topic of the
ethics of surveys, and outlines a number of ethical issues and proposes a number of
basic ethical standards to which surveys of human populations should adhere. The
fifth chapter of the book discusses the primary issues of designing a survey. A major
underlying theme of this chapter is that there is no such thing as an ‘all-purpose sur-
vey’. Experience has repeatedly demonstrated that only surveys designed with a clear
purpose in mind can be successful.
The next nine chapters deal with all the various design issues in a survey, given that
we have established the overall purpose or purposes of the survey. The first of these
chapters (Chapter 6) discusses and describes all the current methods that are available
for conducting surveys of human populations, in which people are asked to partic-
ipate in the survey process. Mention is also made of some methods of dealing with
other types of survey that are appropriate when the objects of the survey are observed
in some way and do not participate in the process. In Chapter 7, the topic of focus
groups is introduced, and potential uses of focus groups in designing quantitative and
qualitative surveys are discussed. The chapter does not provide an exhaustive treat-
ment of this topic, but does provide a significant amount of detail on how to organise
and design focus groups. In Chapter 8, the design of survey instruments is discussed
at some length. Illustrations of some principles of design are included, drawn princi-
pally from transport and related surveys. Chapters 9 and 10 deal with issues relating
to question design and question wording and special issues relating to qualitative and
preference surveys. Chapter 11 deals with the design of data collection procedures
themselves, including such issues as item and unit nonresponse, what constitutes a
Scope of the book 3
complete response, the use of proxy reporting and its effects, and so forth. The seventh
of this group of chapters (Chapter 12) deals with pilot surveys and pretests€– a topic
that is too often neglected in the design of surveys. A number of issues in designing
and undertaking such surveys and tests are discussed. Chapter 13 deals with the topic
of sample design and sampling issues. In this chapter, there is extensive treatment of
the statistics of sampling, including estimation of sampling errors and determination of
sample sizes. The chapter describes most of the available methods of sampling, includ-
ing simple random samples, stratified samples, multistage samples, cluster samples,
systematic samples, choice-based samples, and a number of sampling methods that are
often considered but that should be avoided in most instances, such as quota samples,
judgemental samples, and haphazard samples.
Chapter 14 addresses the topic of repetitive surveys. Many surveys are intended to
be done as a ‘one-off’ activity. For such surveys, the material covered in the preceding
chapters is adequate. However, there are many surveys that are intended to be repeated
from time to time. This chapter deals with such issues as repeated cross-sectional sur-
veys, panel surveys, overlapping samples, and continuous surveys. In particular, this
chapter provides the reader with a means to compare the advantages and disadvantages
of the different methods, and it also assists in determining which is appropriate to
apply in a given situation.
Chapter 15 builds on the material in the preceding chapters and deals with the issue
of survey economics. This is one of the most troublesome areas, because, as many
companies have found out, it is all too easy to be bankrupted by a survey that is under-
taken without a real understanding and accounting of the costs of a survey. While
information on actual costs will date very rapidly, this chapter attempts to provide rel-
ative data on costs, which should help the reader estimate the costs of different survey
strategies. This chapter also deals with many of the potential trade-offs in the design
of surveys.
Chapter 16 delves into some of the issues relating to the actual survey implemen-
tation process. This includes issues relating to training survey interviewers and moni-
toring the performance of interviewers, and the chapter discusses some of the danger
signs to look for during implementation. This chapter also deals with issues regarding
the ethics of survey implementation, especially the relationships between the survey
firm, the client for the survey, and the members of the public who are the respondents
to the survey. Chapter 17 introduces a topic that is becoming of increasing interest:
Web-based surveys. Although this is a field that is as yet quite young, there are an
increasing number of aspects that have been researched and from which the reader can
benefit. Chapter 18 deals with the process of coding and data entry. A major issue in
this topic is the geographic coding of places that may be requested in a survey.
Chapter 19 addresses the topics of data expansion and weighting. Data expansion is
outlined as a function of the sampling method, and statistical procedures for expanding
each of the different types of sample are provided in this chapter. Weighting relates to
problems of survey bias, resulting either from incomplete coverage of the population in
the sampling process or from nonresponse by some members of the subject population.
Introduction4
This is an increasingly problematic area for surveys of human populations, resulting
from a myriad of issues relating to voluntary participation. Chapter 20 addresses the
issue of nonresponse more completely. Here, issues of who is likely to respond and
who is not are discussed. Methods to increase response rates are described, and refer-
ence is made again to the economics of the survey design. The question of computing
response rates is also addressed in this chapter. This is usually the most widely recog-
nised statistic for assessing the quality of a survey, but it is also a statistic that is open
to numerous methods of computation, and there is considerable doubt as to just what
it really means.
Chapter 21 deals with a range of other measures of data quality, some that are gen-
eral and some, by way of example, that are specific to surveys in transport. These mea-
sures are provided as a way to illustrate how survey-specific measures of quality can
be devised, depending on the purposes of the survey. Chapter 22 discusses some issues
of the future of human population surveys, especially in the light of emerging technol-
ogies and their potential application and misapplication to the survey task.
Chapter 23, the final chapter in the book, covers the issues of documenting and
archiving the data. This all too often neglected area of measuring data is discussed at
some length.A list of headings for the final report on the survey is provided, along with
suggestions as to what should be included under the headings. The issue of archiving
data is also addressed at some length. Data are expensive to collect and are rarely
archived appropriately. The result is that many expensive surveys are effectively lost
soon after the initial analyses are undertaken. In addition, knowledge about the survey
is often lost when those who were most centrally involved in the survey move on to
other assignments, or leave to work elsewhere.
1.3â•… Survey statistics
Statistics in general, and survey statistics in particular, constitute a relatively young
area of theory and practice. The earliest instance of the use of statistics is probably in
the middle of the sixteenth century, and related to the start of data collection in France
regarding births, marriages, and deaths, and in England to the collection of data on
deaths in London each week (Berntson et al., 2005). It was then not until the middle
of the eighteenth century that publications began to appear advancing some of the ear-
liest theories in statistics and probability. However, much of the modern development
of statistics did not take place until the late nineteenth and early twentieth centuries
(Berntson et al., 2005):
Beginning around 1880, three famous mathematicians, Karl Pearson, Francis Galton and Edgeworth,
created a statistical revolution in Europe. Of the three mathematicians, it was Karl Pearson, along
with his ambition and determination, that led people to consider him the founder of the twentieth-
century science of statistics.
It was only in the early twentieth century that most of the now famous names in sta-
tistics made their contributions to the field. These included such statisticians as Karl
Survey statistics 5
Pearson, Francis Galton, C. R. Rao, R. A. Fisher, E. S. Pearson, and Jerzy Neyman,
among many others, who all made major contributions to what we know today as the
science of statistics and probability.
Survey sampling statistics is of even more recent vintage. Among the most notable
names in this field of study are those of R. A. Fisher, Frank Yates, Leslie Kish, and
W. G. Cochran. Fisher may have given survey sampling its birth, both through his own
contributions and through his appointment of Frank Yates as assistant statistician at
Rothamsted Experimental Station in 1931. In this post, Yates developed, often in col-
laboration with Fisher, what may be regarded as the beginnings of survey sampling in
the form of experimental designs (O’Connor and Robertson, 1997). His book Sampling
Methods for Censuses and Surveys was first published in 1949, and it appears to be the
first book on statistical sampling designs.
Leslie Kish, who founded the Survey Research Institute at the University of
Michigan, is also regarded as one of the founding fathers of modern survey sampling
methods, and he published his seminal work, called Survey Sampling, in 1965. Close
in time to Kish, W. G. Cochran published his seminal work, Sampling Techniques, in
1963.
Based on these efforts, the science of survey sampling cannot be considered to be
much over fifty years old€– a very new scientific endeavour. As a result of this rela-
tive recency, there is still much to be done in developing the topic of survey sampling,
while technologies for undertaking surveys have undergone and continue to undergo
rapid evolution. The fact that most of the fundamental books on the topic are about
forty years old suggests that it is time to undertake an updated treatise on the topic.
Hence, this book has been undertaken.
6
2	 Basic statistics and probability
2.1â•… Some definitions in statistics
Statistics is defined by the Oxford Dictionary of English Etymology as ‘the political
science concerned with the facts of a state or community’, and the word is derived
from the German statistisch. The beginning of modern statistics was in the sixteenth
century, when large amounts of data began to be collected on the populations of coun-
tries in Europe, and the task was to make sense of these vast amounts of data. As statis-
tics has evolved from this beginning, it has become a science concerned with handling
large quantities of data, but also with using much smaller amounts of data in an effort
to represent entire populations, when the task of handling data on the entire population
is too large or expensive. The science of statistics is concerned with providing inputs
to political decision making, to the testing of hypotheses (understanding what would
happen if …), drawing inferences from limited data, and, considering the data limita-
tions, doing all these things under conditions of uncertainty.
A word used commonly in statistics and surveys is population. The population is
defined as the entire collection of elements of concern in a given situation. It is also
sometimes referred to as a universe. Thus, if the elements of concern are pre-school
children in a state, then the population is all the pre-school children in the state at the
time of the study. If the elements of concern are elephants in Africa, then the popula-
tion consists of all the elephants currently in Africa. If the elements of concern are the
vehicles using a particular freeway on a specified day, then the population is all the
vehicles that use that particular freeway on that specific day.
It is very clear that statistics is the study of data. Therefore, it is necessary to
understand what is meant by data. The word data is a plural noun from the Latin
datum, meaning given facts. As used in English, the word means given facts from
which other facts may be inferred. Data are fundamental to the analysis and model-
ling of real-world phenomena, such as human populations, the behaviour of firms,
weather systems, astronomical processes, sociological processes, genetics, etc.
Therefore, one may state that statistics is the process for handling and analysing
data, such that useful conclusions can be drawn, decisions made, and new knowledge
accumulated.
Some definitions in statistics 7
Another word used in connection with statistics is observation. An observation may
be defined as the information that can be seen about a member of a subject population.
An observation comprises data about relevant characteristics of the member of the
population. This population may be people, households, galaxies, private firms, etc.
Another way of thinking of this is that an observation represents an appropriate group-
ing of data, in which each observation consists of a set of data items describing one
member of the population.
A parameter is a quantity that describes some property of the population. Parameters
may be given as numbers, proportions, or percentages. For example, the number of
male pre-school children in the state might be 16,897, and this number is a parameter.
The proportion of baby elephants in Africa might be 0.39, indicating that 39 per cent
of all elephants in Africa at this time are babies. This is also a parameter. Sometimes,
one can define a particular parameter as being critical to a decision. This would then
be called a decision parameter. For example, suppose that a decision is to be made
as to whether or not to close a primary school. The decision parameter might be the
number of schoolchildren that would be expected to attend that school in, say, the next
five years.
A sample is some subset of a population. It may be a large proportion of the popu-
lation, or a very small proportion of the population. For example, a survey of Sydney
households, which comprise a population of about 1,300,000 might consist of 130,000,
households (a 10 per cent sample) or 300 households (a 0.023 per cent sample).
A statistic is a numerical quantity that describes a sample. It is therefore the equiva-
lent of a parameter, but for a sample rather than the population. For example, a survey
of 130,000 households in Sydney might have shown that 52 per cent of households
own their own home or are buying it. This would be a statistic. If, on the other hand, a
figure of 54 per cent was determined from a census of the 1,300,000 households, then
this figure would be a parameter.
Statistical inference is the process of making statements about a population based
on limited evidence from a sample study. Thus, if a sample of 130,000 households
in Sydney was drawn, and it was determined that 52 per cent of these owned or were
purchasing their homes, then statistical inference would lead one to propose that this
might mean that 676,000 (52 per cent of 1,300,000) households in Sydney own or are
purchasing their homes.
2.1.1â•… Censuses and surveys
Of particular relevance to this book is the fact that there are two methods for collect-
ing data about a population of interest. The first of these is a census, which involves
making observations of every member of the population. Censuses of the human pop-
ulation have been undertaken in most countries of the world for many years. There are
references in the Bible to censuses taken among the early Hebrews, and later by the
Romans at the time of the birth of Christ. In Europe, most censuses began in the eigh-
teenth century, although a few began earlier than that. In the United States of America,
Basic statistics and probability8
censuses began in the nineteenth century. Many countries undertake a census once
in each decade, either in the year ending in zero or in one. Some countries, such as
Australia, undertake a census twice in each decade. A census may be as simple as a
head count (enumerating the total size of the population) or it may be more complex,
by collecting data on a number of characteristics of each member of the population,
such as name, address, age, country of birth, etc.
A survey is similar to a census, except that it is conducted on a subset of the popula-
tion, not the entire population.A survey may involve a large percentage of the population
or may be restricted to a very small sample of the population. Much of the science of
survey statistics has to do with how one makes a small sample represent the entire popu-
lation. This is discussed in much more detail in the next chapter. A survey, by definition,
always involves a sample of the population. Therefore, to speak of a 100 per cent sample
is contradictory; if it is a sample, it must be less than 100 per cent of the population.
2.2â•… Describing data
One of the first challenges for statistics is to describe data. Obviously, one can provide
a complete set of data to a decision maker. However, the human mind is not capable
of utilising such information efficiently and effectively. For example, a census of the
United States would produce observations on over 300 million people, while one of
India would produce observations of over 1 billion people. A listing of those observa-
tions represents something that most human beings would be incapable of utilising.
What is required, then, is to find some ways to simplify and describe data, so that use-
ful information is preserved but the sheer magnitude of the underlying data is hidden,
thereby not distracting the human analyst or decision maker.
Before examining ways in which data might be presented or described, such that the
mind can grasp the essential information contained therein, it is important to under-
stand the nature of different types of data that can be collected. To do this, it seems
useful to consider the measurement of a human population, especially since that is the
main topic of the balance of this book.
In mathematical statistics, we refer to things called variables. A variable is a char-
acteristic of the population that may take on differing or varying values for different
members of the population. Thus, variables that could be used to describe members
of a human population may include such characteristics as name, address, age or date
of birth, place of birth, height, weight, eye colour, hair colour, and shoe size. Each of
these characteristics provides differing levels of information that can be used in vari-
ous ways. We can divide these characteristics into four different types of scales, a scale
representing a way of measuring the characteristic.
2.2.1â•… Types of scales
Nominal scales
Each person in the population has a name. The person’s name represents a label by
which that person can be identified, but provides little other information. Names can
Describing data 9
be ordered alphabetically or can be ordered in any of a number of arbitrary ways, such
as the order in which data are collected on individuals. However, no information is
provided by changing the order of the names. Therefore, the only thing that the name
provides is a label for each member of the population. This is called a nominal scale.
A nominal scale is the least informative of the different types of scales that can be used
to measure characteristics, but its lack of other information does not render it of less
value. Other examples of nominal data are the colours of hair or eyes of the members
of the population, bus route numbers, the numbers assigned to census collection dis-
tricts, names of firms listed on a country’s stock exchange, and the names of magazines
stocked by a newsagency.
Ordinal scales
Each person in the population has an address. The address will usually include a house
number and a street name, along with the name of the town or suburb in which the
house is located. The address clearly also represents a label, just as does the person’s
name. However, in the case of the address, there is more information provided. If the
addresses are sorted by number and by street, in most places in the world this will pro-
vide additional information. These sorted addresses will actually help an investigator
to locate each home, in that it is expected that the houses are arranged in numerical
order along the street, and probably with odd numbers on one side of the street and
even numbers on the other side. As a result, there is order information provided in the
address. It is, therefore, known as an ordinal scale. However, if it is known that one
person lives at 27 Main Street, and another person lives at 35 Main Street, this does not
indicate how far apart these two people live. In some countries, they could be next door
to each other, while in others there might be three houses between them or even seven
houses between them (if numbering goes down one side of the street and back on the
other). The only thing that would be known is that, starting at the first house on Main
Street, one would arrive at 27 before one would arrive at 35. Therefore, order is the
only additional information provided by this scale. Other examples of ordinal scales
would be the list of months in the year, censor ratings of movies, and a list of runners
in the order in which they finished a race.
Interval scales
Each person in the population has a shoe size. For the purposes of this illustration,
the fact that there are slight inconsistencies in shoe sizes between manufacturers will
be ignored, and it will be assumed, instead, that a man’s shoe size nine is the same
for all men’s shoes, for example. Shoe size is certainly a label, in that a shoe can be
called a size nine or a size twelve, and so forth. This may be a useful way of labelling
shoes for a lot of different reasons. In addition, there is clearly order information, in
that a size nine is smaller than a size twelve, and a size seven is larger than a size five.
Furthermore, within each of children’s, men’s, and women’s shoes, each increase in
a size represents a constant increase in the length of the shoe. Thus, the difference
between a size nine and a size ten shoe for a man is the same as the difference between
a size eight and a size nine, and so on for any two adjacent numbers. In other words,
Basic statistics and probability10
there is a constant interval between each shoe size. On the other hand, there is no nat-
ural zero in this scale (in fact, a size of zero generally does not exist), and it is not true
that a size five is half the length of a size ten. Therefore, shoe size may be considered
to be an interval scale. Women’s dress sizes in a number of countries also represent
an interval scale, in which each increment in dress size represents a constant interval
of increase in size of the dress, but a size sixteen dress is not twice as large as a size
eight. In many cases, the sizing of an item of clothing as small, medium, large, etc. also
represents an interval scale. Another example of an interval scale is the normal scale of
temperature in either degrees Celsius or degrees Fahrenheit. An interval of one degree
represents the same increase or decrease in temperature, whether it is between 40 and
41 or 90 and 91. However, we are not able to state that 60 degrees is twice as hot as
30€degrees. There is also not a natural zero on either the Celsius scale or the Fahrenheit
scale. Indeed, the Celsius scale sets the temperature at which water freezes as 0, but
the Fahrenheit scale sets this at 32, and there is not a particular physical property of the
zero on the Fahrenheit scale.
Ratio scale
Each member of the population has a height and a weight. Again, each of these two
measures could be used as a label. We might say that a person is 180 centimetres tall,
or weighs 85 kilograms. These measures also contain ordinal information. We know
that a person who weighs 85 kilograms is heavier than a person who weighs 67 kilo-
grams. Furthermore, we know that these measures contain interval information. The
difference between 179 centimetres and 180 centimetres is the same as the difference
between 164 centimetres and 165 centimetres. However, there is even more information
in these measures. There is ratio information. In other words, we know that a person
who is 180 centimetres tall is twice as tall as a person who is 90 centimetres tall, and
that a person weighing 45 kilograms is only half the weight of a person weighing 90
kilograms. There are two important new pieces of information provided by these mea-
sures. First, there is a natural zero in the measurement scale. Both weight and height
have a zero point, which represents the absence of weight or the absence of height.
Second, there is a multiplicative relationship among the measures on the scale, not just
an additive one. Therefore, both weight and height are described as ratio scales. Other
examples of ratio scales are distance or length measures, measures of speed, measures
of elapsed time, and so forth. However, it should be noted that measurement of clock
time is interval-scaled (there is no natural zero, and 5 a.m. is not a half of 10 a.m.),
while elapsed time is ratio-scaled, because zero minutes represents the absence of any
elapsed time, and twenty minutes is twice as long as ten minutes, for example.
Measurement scales
The preceding sections have outlined four scales of measurement: nominal, ordinal,
interval, and ratio. They have also demonstrated that these four scales are themselves
an ordinal scale, in which the order, as presented in the preceding sentence, indicates
increasing information content. Furthermore, each of the scale types, as ordered above,
Describing data 11
contains the information of the previous type of scale, and then adds new information
content. Thus an ordinal scale also has nominal information, but adds to that informa-
tion on order; an interval scale has both nominal and ordinal information, but adds to
that a consistent interval of measurement; and a ratio scale contains all nominal, ordi-
nal, and interval information, but adds ratio relationships to them.
There are two other ways in which scales can be described, because most scales can
be measured in different ways. The first of these relates to whether the scale is contin-
uous or discrete. A continuous scale is one in which the measurement can be made to
any degree of precision desired. For example, we can measure elapsed time to the near-
est hour, or minute, or second, or nanosecond, etc. Indeed, the only thing that limits
the precision by which we can measure this scale is the precision of our instruments
for measurement. However, there is no natural limit to precision in such cases. This is
a continuous scale. A discrete scale, on the other hand, cannot be subdivided beyond
a certain point. For example, shoe sizes are a discrete scale. Many shoe manufactur-
ers will provide shoes in half-size increments, while others will provide them only in
whole-size increments. Subdivision below half sizes simply is not done. Similarly, any
measurement that involves counting objects, such as counting the number of members
of a population, is a discrete scale. We cannot have fractional people, fractional houses,
or fractional cars, for example.
The second descriptor of a scale is whether it is inherently exact or approximate. By
their nature, all continuous scales are approximate. This is so because we can always
increase the precision of measurement. Generally, numbers obtained from counting
are exact, unless the counting mechanism is capable of error. However, other discrete
scales may be approximate or exact. In most clothing or shoe sizes, the measure would
be considered approximate, because sizes often differ between manufacturers, and
between countries. A size nine shoe is not the same size in the United States and in the
United Kingdom, for example, nor is it necessarily the same size from two different
shoe makers in the same country.
It is important to recognise what type of a scale we are dealing with, when infor-
mation is measured on scales, because the type of scale will also often either dictate
how the information can be presented or restrict the analyst to certain ways of pre-
sentation. Similarly, whether the measure is discrete or continuous will also affect the
presentation of the data, as will, in some cases, whether the data are approximate or
exact.
2.2.2â•… Data presentation: graphics
It is appropriate to start with some simple rules about graphical presentations. There
are four principal types of graphical presentation: scatter plots, pie charts, histograms
or bar charts, and line graphs.
A scatter plot is a plot of the frequency with which specific values of a pair of vari-
ables occur in the data. Thus, the X-axis of the plot will contain the values of one of the
variables that are found in the data, and the Y-axis will contain the values of the other
Basic statistics and probability12
variable. As such, any type of measure can be presented on a scatter plot. However, if
all values occur only once€– i.e., are unique to an observation€– then a scatter plot is
of no particular interest. Therefore although any data can theoretically be plotted on a
scatter plot, data that represent unique values, or data that are continuous, and also will
probably have frequencies of only one or two at most for any pair of values, will not
be illuminated by a scatter plot.
An example of a scatter plot is provided in Figure 2.1, which shows a scatter plot of
odometer readings of cars versus the model year of the vehicle. The Y-axis is a ratio-
scaled variable, and the X-axis is an interval-scaled variable. The scatter plot indicates
that there probably is a relationship between odometer readings and model year, such
that the higher the model year value, the lower the odometer reading, as would be
expected. This is a useful scatter plot.
Figure 2.2 illustrates a scatter plot of two nominal-scaled variables: fuel type versus
body type. It is not a very useful illustration of the data. First, we cannot tell how many
points fall at each combination of values. Second, all it really tells us is that there are
no taxis (body type 5) in this data set, that all vehicle types use petrol (fuel type 1), that
all except motorcycles (body type 6) use diesel (fuel type 2), and that only cars (body
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1950 1960 1970 1980 1990 2000 2010
Model year
Odometerreading
Figure 2.1╇ Scatter plot of odometer reading versus model year
0
1
2
3
4
5
0 1 2 3 4 5 6 7 8
Body type
Fueltype
Figure 2.2╇ Scatter plot of fuel type by body type
Describing data 13
type 1), four-wheel drive (4WD) vehicles (body type 2), and utility/van/panel vans
(body type 3) use dual fuel (fuel type 4). This illustrates that nominal data€– both fuel
type and body type are nominal scales€– may not produce a useful scatter plot.
A pie chart is a circle that is divided into segments representing specific values in
the data, with the length of the segment along the circumference of the circle indicating
how frequently the value occurs in the data.Again, pie charts can be used with any type
of data, when the information to be presented is the frequency of occurrence. However,
they will generally not work with continuous data, unless the data are first grouped and
converted to discrete categories. An example of a pie chart is provided in Figure 2.3.
This shows that the pie chart works well for nominal data, in this case the vehicle body
type from a survey of households.
Figure 2.4 shows a pie chart for category data€– i.e., discrete data. The data are
reported household incomes from a survey of households. The categories were those
used in the survey. Income, being measured in dollars and with a natural zero, is actu-
ally a ratio scale. In the categories collected, income is a ratio-scaled discrete measure.
Again, the pie chart provides a good representation of the data.
A histogram or bar chart is used for presenting discrete data. Such data will be
interval- or ratio-scaled data. Histograms can be constructed in several different ways.
When presenting complex information, bars can be stacked, showing how different
4WD
Car
Motorcycle
Other
Taxi
Truck
Utility vehicle
Figure 2.3╇ Pie chart of vehicle body types
None
$1–$4,159
$4,160–$8,319
$8,320–$15,599
$15,600–$25,999
$26,000–$36,399
$36,400–$51,999
$52,000–$62,399
$62,400–$77,999
$78,000–$103,999
$104,000+
Don't know
Refused
Figure 2.4╇ Pie chart of household income groups
Basic statistics and probability14
classes of items add up to a total within each bar. Bars can also be plotted so that each
bar touches the next, or they may be plotted with gaps between. There is no particular
rule for plotting bars in this manner, and it is more a matter of personal preference.
Examples of two types of histograms are shown in Figures 2.5 and 2.6. Histograms can
also be used to indicate the frequency of occurrence of specific values of both nominal
and ordinal data. In this case, it is preferred that the bars do not touch, the spaces indi-
cating that the scale is not interval or ratio.
Figure 2.5 shows ratio-scaled discrete data on household incomes, this time in a
two-dimensional histogram or bar chart. Note that the bars touch, indicating the under-
lying continuous nature of the data. Figure 2.6 shows a histogram of nominal data
frequencies of vehicle type for household vehicles. Two instructive observations may
be made of this histogram. First, the dominance of the car tends to make the histogram
0
20
40
60
80
100
120
140
160
Numberof
respondents
N
one$1–$4,159
$4,160–$8,319
$8,320–$15,599
$15,600–$25,999
$26,000–$36,399
$36,400–$51,999
$52,000–$62,399
$62,400–$77,999
$78,000–$103,999$104,000+
Annual income
Figure 2.5╇ Histogram of household income
0
200
400
600
800
1,000
1,200
4WD Car Motorcycle Other Taxi Truck Utility
vehicle
Vehicle type
Number
Figure 2.6╇ Histogram of vehicle types
Describing data 15
somewhat less useful. In contrast, the pie chart really communicated the information
better. Second, the bars do not touch, in this case clearly indicating the discrete cate-
gories of a nominal scale.
The fourth type of chart is a line graph. This is much more restricted in application
than the other types of charts. A line graph should be used only with continuous data,
whether interval- or ratio-scaled. It is inappropriate to use line graphs to present data
that are discrete, or data that are nominal or ordinal in nature. An example of a line
graph is shown in Figure 2.7.
Temperature is inherently a continuous measurement. It is therefore appropriate to
use a line graph to present these data. This case demonstrates the use of two lines on
the same graph. This allows one not only to see the maximum and minimum tempera-
tures, but also to deduce that there may be a relationship between the two.
A special type of line graph is an ogive. An ogive is a cumulative frequency line.
Even when the original data are discrete in nature, the ogive can be plotted as a line,
although a cumulative histogram is preferable. Generally, it makes sense to create
cumulative graphs only of interval- or ratio-scaled data, although the data may be
either discrete or continuous. Figure 2.8 shows an ogive for the income data used in
Figure 2.5.
The ogive is essentially an S-shaped curve, in that it starts with a line that is along
the X-axis and ends with a line that is parallel to the X-axis, with the line climbing
more or less continuously from the X-axis at the left to the top of the graph at the right.
A special case of the ogive is a relative ogive, in which the proportions or percentage
of observations are used, not the absolute counts, as in Figure 2.8. A relative ogive for
0
5
10
15
20
25
30
35
Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
Monday
Day of week
Temperature(°C)
Maximum temperature
Minimum temperature
Figure 2.7╇ Line graph of maximum and minimum temperatures for thirty days
Basic statistics and probability16
the same data will have the same shape, but the scale of the Y-axis changes, as shown
in Figure 2.9.
A step chart, which is the discrete version of an ogive, could also be drawn for
the income data. It can use either the count, the proportion, or the percentage for the
Y-axis. A step chart is shown in Figure 2.10.
2.2.3â•… Data presentation: non-graphical
Graphical presentations of data are very useful. As can be seen in the preceding sec-
tion, the adage that ‘a picture is worth a thousand words’ is clearly interpretable as
‘a picture is worth a thousand numbers’. Indeed, one can grasp rather readily from
0
100
200
300
400
500
600
700
800
900
N
one$1–$4,159
$4,160–$8,319$8,320–$15,599
$15,600–$25,999
$26,000–$36,399
$36,400-$51,999
$52,000–$62,399
$62,400–$77,999
$78,000–$103,999$104,000+
Household income
Cumulativenumber
Figure 2.8╇ Ogive of cumulative household income data from Figure 2.5
0
0.2
0.4
0.6
0.8
1
$0
$1–$4,159
$4,160–$8,319
$8,320–$15,599
$15,600–$25,999
$26,000–$36,399
$36,400–$51,999
$52,000–$62,399
$62,400–$77,999
$78,000–$103,999$104,000+
Household income
Cumulativeproportion
Figure 2.9╇ Relative ogive of household income
Describing data 17
the graphs what is potentially a large amount of data, which the human mind would
have difficulty grasping as raw data. However, pictures are not the only ways in
which data can be presented for easier assimilation. There are also numeric ways to
describe data. Ideally, what one would like would be some summary variables that
would give one an idea about the magnitude of each variable in the data, the disper-
sion of values, the variability of the values, and the symmetry or lack of symmetry
in the data.
Measures of magnitude
These measures could include such concepts as frequencies of occurrence of particular
values in the data, proportions of the data that possess a particular value, cumulative
frequencies or proportions, and some form of average value. Each of these measures
is considered separately.
Frequencies and proportions
Frequencies are simply counts of the number of times that a particular value occurs in
the data, while proportions are frequencies divided by the total number of observations
in the data. Table 2.1 shows the frequencies of occurrence of the different vehicle types
used in the earlier illustrations of graphical presentations.
For nominal data, cumulative frequencies or proportions are not sensible, because
the scale does not contain any ordered information. Thus, to produce a cumulative
frequency distribution for the entries in Table 2.1 would not make sense. Moreover, it
should be noted that frequencies and proportions are generally sensible only for dis-
crete data. However, if, in continuous data, there are large numbers of observations
with the same value and the data set is large, then frequencies and proportions may
possibly be useful. This, for example, might be the case for national data derived from
a census.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
$0
$1–$4,159
$4,160–$8,319
$8,320–$15,599
$15,600–$25,999
$26,000–$36,399
$36,400–$51,999
$52,000–$62,399
$62,400–$77,999
$78,000–$103,999
$104,000+
Household income
Cumulativeproportion
Figure 2.10╇ Relative step chart of household income
Basic statistics and probability18
For the income data plotted in Figure 2.5, both frequencies or proportions, and cumu-
lative frequencies or proportions, make sense. These are shown in Table 2.2. From the
information in Table 2.2 it is possible to grasp several things about the data on household
income, such as the fact that the largest group is the one with $15,600–$25,999 annual
income, followed by $78,000–$103,999 and $36,400–$51,999. It can also be seen that
16 per cent of households would not report their income. When the non-reported income
is excluded, one can see that the proportions change substantially, and that a half of the
population have incomes below $62,399. In effect, this table has summarised over 1,000
pieces of data and made them comprehensible, by presenting just a handful of numbers.
In the case of the income data shown in Table 2.2, the groups were defined in the sur-
vey itself. However, one may also take data that are collected as continuous measures
and group them, both to display as a histogram and to present them in a table, similar
to Table 2.2. In such a case, it is necessary to know into how many categories to group
the data.
Number of classes or categoriesâ•… Sturges’ rule (Sturges, 1926) provides
guidance on how to determine the maximum number of classes into which to divide
data, whether grouping already discrete data or continuous data. There are a number
of elements to the rule.
(1)	 Interval classes must be inclusive and non-overlapping.
(2)	 Intervals should usually be of equal width, although the first and last interval may
be open-ended for some types of data.
(3)	 The number of classes depends on the number of observations in the data, accord-
ing to equation (2.1):
	 k = 1 + 3.322 × (log10â•›n) 	 (2.1)
	where k = the number of classes, n = the number of observations.
Suppose, for example, that the income data had been collected as actual annual
income, and not in income classes. One might then ask the question as to how many
Table 2.1 Frequencies and proportions of vehicle types
Vehicle type Frequency Proportion
Car 1,191 0.817
4WD 96 0.066
Utility vehicle 134 0.092
Truck 10 0.007
Taxi 0 0
Motorcycle 19 0.013
Other 7 0.005
Total 1457 1.000
Describing data 19
classes would be the maximum that could be used for income. This would be obtained
by substituting 900 into the above equation, because one should not include the miss-
ing data. This would result in a value for k of 10.81, which would be truncated to 10.
Therefore, Sturges’ rule would indicate that the maximum number of intervals that
should be used for these data is ten. The data were actually collected in eleven classes.
Therefore, this would suggest that the design was marginally appropriate and there
should not be a need to group together any of the classes with the number of valid
observations obtained. However, the intervals used violate Sturges’ rule in one respect,
in that they are not of equal size. This is not uncommon with income grouping, where
it is often the case, as here, that the lower incomes are divided into smaller classes than
the higher incomes. This is generally done to keep the population of the classes more
nearly equal.
Suppose that the temperature data used in Figure 2.7 were to be grouped into clas-
ses. The raw data are shown in Table 2.3. There are thirty observations of daily maxi-
mum and minimum temperatures in this data set. Applying Sturges’ rule, the value of
k is found to equal 5.92, suggesting that five intervals would be the most that could be
used. For the high temperatures, the range is twenty-two to thirty-three. If this range
is divided into groupings of two degrees, this would produce six intervals, while using
three degrees would produce four intervals. In this case, given that k was found to be
close to six, it would be best to use six intervals of two degrees per interval. For the low
temperatures, the range is from sixteen to twenty-two. Grouping these also into groups
of two degrees in size, which is preferable when one wants to look at both minimum
Table 2.2 Frequencies, proportions, and cumulative values for household income
Income range Frequency Proportion
Cumulative
frequency
Cumulative proportion
Including
missing
Excluding
missing
None 28 0.0262 28 0.0262 0.0311
$1–$4,159 2 0.0019 30 0.0280 0.0333
$4,160–$8,319 11 0.0103 41 0.0383 0.0456
$8,320–$15,599 67 0.0626 108 0.1009 0.1200
$15,600–$25,999 155 0.1449 263 0.2458 0.2922
$26,000–$36,399 97 0.0907 360 0.3364 0.4000
$36,400–$51,999 129 0.1206 489 0.4570 0.5433
$52,000–$62,399 72 0.0673 561 0.5243 0.6233
$62,400–$77,999 105 0.0981 666 0.6224 0.7400
$78,000–$103,999 133 0.1243 799 0.7467 0.8878
$104,000+ 101 0.0944 900 0.8411 1.0000
Don’t know 1 0.0009 901 0.8421
Refused 169 0.1579 1,070 1.0000
Total 1,070 1
Basic statistics and probability20
and maximum temperatures on the same graph, or in side-by-side graphs, would result
in four groups. Because this is less than the maximum of six, it is acceptable. In this
case, grouping is sensible only if what one wants to do is to create a histogram of the
frequency with which various maximum and minimum temperatures occur. Such a
frequency table is shown in Table 2.4.
There is a second variant of Sturges’ rule for binary data. This variant defines the
number of classes, as shown in equation (2.2):
	 k = 1 + log2(n) 	 (2.2)
Table 2.3 Minimum and maximum
temperatures for a month (°C)
Day
Maximum
temperature
Minimum
temperature
Sunday 23 18
Monday 26 19
Tuesday 25 19
Wednesday 27 17
Thursday 32 22
Friday 29 21
Saturday 26 20
Sunday 27 19
Monday 30 22
Tuesday 31 21
Wednesday 33 23
Thursday 24 20
Friday 25 18
Saturday 27 19
Sunday 28 20
Monday 32 22
Tuesday 24 18
Wednesday 26 16
Thursday 25 17
Friday 22 17
Saturday 28 19
Sunday 27 20
Monday 28 20
Tuesday 29 21
Wednesday 28 20
Thursday 26 19
Friday 27 20
Saturday 30 21
Sunday 29 20
Monday 31 23
Describing data 21
When n is less than 1,000, the two equations result in approximately the same num-
ber of classes. For example, for 900 cases, this second formula gives k equal to 10.81,
which is the identical result. For the thirty-observation case, the second formula gives
5.91, which is almost identical. It has been pointed out in various places (see Hyndman,
1995) that Sturges’ rule is good only for samples less than 200, and that it is based on a
flawed argument. Nevertheless, it is still the standard used by most statistical software
packages. There are two other rules that may be used, and these are discussed later in
this chapter, because they utilise statistical measures that have not been discussed at
this point. All the rules produce similar results for small samples, but diverge as the
sample size becomes increasingly large. The other possible problems with Sturges’
rule are, first, that it may lead to over-smoothed data and, second, that its requirement
for equal intervals may hide important information.
Stem and leaf displaysâ•… Another way to display discrete data is to use a
stem and leaf display. Essentially, the stem is the most aggregate level of grouping
of the data, while the leaf is made up of more disaggregate data. Table 2.5 shows
some household data when the actual income was collected, rather than having people
respond to pre-defined classes.
A stem and leaf display would be constructed, for example, by using the tens of
thousands of dollars as the stem and the thousands as the leaf. This, like a histogram,
provides a picture of the distribution of the data, as shown in Figure 2.11. This graphic
shows clearly the nature of the distribution of incomes.
Central measures of data
There are at least six different averages that can be computed, which provide different
ways of assessing the central value of the data. The six that are discussed here are:
(1)	 arithmetic mean;
(2)	 median;
Table 2.4 Grouped temperature data
Temperature
range
Number
of highs
Number
of lows
Cumulative
number of highs
Cumulative
number of lows
16–17 0 4 0 ╇ 4
18–19 0 9 0 13
20–21 0 12 0 25
22–23 2 5 4 30
24–25 5 0 7 30
26–27 9 0 16 30
28–29 7 0 23 30
30–31 4 0 27 30
32–33 3 0 30 30
Basic statistics and probability22
Stem Leaf
0 3 4 4 6 6 7 9 9
1 2 3 3 4 6 6 8 9 9
2 0 0 1 2 2 3 4 4 5 6 7
3 1 3 4 6 7 7 8 9 9
4 1 4 5 6 7 7
5 0 4 5 5 7
6 6 7 8 9
7 0 1 2 6
8 9
9 1 6
10 1
Figure 2.11╇ Stem and leaf display of income
Table 2.5. Disaggregate household income data
Household
number
Annual
income
Household
number
Annual
income
Household
number
Annual
income
1 $22,358 21 $9,226 41 $70,135
2 $24,679 22 $96,435 42 $100,563
3 $37,455 23 $55,341 43 $3,877
4 $46,223 24 $89,367 44 $2,954
5 $22,790 25 $12,984 45 $6,422
6 $38,656 26 $21,444 46 $16,351
7 $49,999 27 $36,339 47 $19,222
8 $76,450 28 $20,105 48 $56,778
9 $53,744 29 $44,446 49 $41,237
10 $18,919 30 $34,288 50 $24,892
11 $44,881 31 $25,678 51 $31,084
12 $26,570 32 $4,122 52 $68,008
13 $12,135 33 $7,390 53 $71,039
14 $46,990 34 $65,809 54 $13,133
15 $37,855 35 $47,001 55 $18,259
16 $32,568 36 $23,874 56 $14,249
17 $8,917 37 $39,007 57 $36,898
18 $19,772 38 $67,445 58 $91,045
19 $72,455 39 $54,890 59 $6,341
20 $69,078 40 $22,378 60 $15,887
Describing data 23
(3) mode;
(4) geometric mean;
(5) harmonic mean; and
(6) quadratic mean.
The arithmetic mean The arithmetic mean is simply the total of all the
values in the data divided by the number of elements in the sample that provided valid
values for the statistic.
Mathematically, it is usually written as equation (2.3):
x
x
n
ii
n
1 (2.3)
In words, the mean of the variable x is equal to the sum of all the values of x in the data
set, divided by the number of observations, n. It is important to note that values of x
that contribute to the estimation of the mean are only those that are valid, and that n is
also a count of the valid observations. Thus, in the income data we used previously, the
missing values would be removed, and a mean, if it was calculated, would be based on
900 observations, not on the 1,070 survey returns.
The sample mean – i.e., the value of the mean estimated from a sample of observa-
tions – is normally denoted by the symbol x̅, while the true mean from the population
is denoted by the Greek letter μ. It is a convention in statistics to use Greek letters to
denote true population values, and the equivalent Roman letter to denote the sample
estimate of that value. Put another way, the parameter is denoted by a Greek letter, and
the statistic by the equivalent Roman letter.
Using the temperature data from Table 2.3, the sum of the maximum temperatures
is found to be 825, which yields an arithmetic mean of 27.5°C. Similarly, the sum of
the minimum temperatures is 591, which gives an arithmetic mean of 19.7°C. In each
of these cases there were thirty valid observations, so the total or sum was divided by
thirty to give the arithmetic mean. Similarly, using the income data from Table 2.5, the
sum of the incomes is $2,248,437. With sixty valid observations of income, the arith-
metic mean of income is $37,474.
The arithmetic mean (usually referred to simply as the mean, because it is the mean
most often used) can also be understood by considering it as being the centre of gravity
of the data. This is shown in Figure 2.12. In each of the two distributions shown in the
figure, the fulcrum or balance point represents the mean. In the distribution on the left
the mean is at thirteen, while in the one on the right it is at fourteen.
Figure 2.12 illustrates two important facts. First, the symmetry or lack of it in a distri-
bution of values will affect where the mean falls. Second, the arithmetic mean is influ-
enced by extreme values. If the value of twenty were removed from the data distribution
on the right of Figure 2.12, the mean would shift to thirteen. On the other hand, if the
extreme value had been at twenty-five instead of twenty, the mean value would shift to
Basic statistics and probability24
14.5. These changes come about by changing one out of nine observations, suggesting
some substantial sensitivity of the mean to a relatively small change in the data.
The medianâ•… The median is the central value of the data, or it can be
defined to be the value for which half the data are above the value and half are below.
For any data, the median value is most easily found by ordering the data in increasing
or decreasing value and then finding the midpoint value. For the temperature data, this
is seen fairly easily in the grouped data of Table 2.4. For the maximum temperature,
the dividing point between the first fifteen values and the last fifteen values is found at
27°C, which is therefore the median value. Similarly, for the minimum temperatures,
the median is 20°C. Note that the median must be a whole number of degrees in these
cases, because the data are reported only in whole numbers of degrees. Note that the
medians of each of these two variables are not exactly the same as the means, although
they are very close.
For already grouped data, the median must be a range. Looking back at the income
data in Table 2.2, and using the cumulative proportions with the missing data excluded,
it can be seen that the median falls in the interval $26,000–$36,399. For the income
data in Table 2.5, the median can be an actual value. However, because there is an even
number of observations, the median actually falls between the thirtieth and the thirty-
�first observations, so between $32,568 and $34,288. By interpolation, the median
would be $33,428. Comparing this to the mean, it is noted that the mean is quite a bit
higher at $37,474.
The modeâ•… The mode is the most frequently occurring value in a set of
observations. For the maximum temperature data, the mode occurs at 27°C, for which
there are five observations. For the minimum temperature, the mode occurs at 20°C,
for which there are eight days on which this temperature occurs. For the income data
in Table 2.2, the mode is $8,320–$15,599. This is quite different from the median. For
the income data in Table 2.5, there is no mode for the ungrouped data, because each
value is unique. To find a mode, it is necessary to group the data. This has, effectively,
been done in the stem and leaf display, from which it can be determined that the mode
11 12 13 14 15 11 12 13 14 15 16 17 18 19 20
Figure 2.12╇ Arithmetic mean as centre of gravity
Source: Ewart, Ford, and Lin (1982: 38).
Describing data 25
is in the range of $20,000–$24,999, which contains eight households. Using classes
of $5,000 for the ranges, there is no other range that has as many households in it. If
ranges of $10,000 were used, then the mode would be in the range $20,000–$29,999.
Unlike all the other mean values, there may be more than one mode. In fact, the limit
on the number of modes that can occur is the number of observations, if each value
occurs only once in the data set. However, this is not a useful result, and data in which
each value occurs only once, as in the income data, should be grouped to provide more
useful information. Data may be distributed bimodally or trimodally, or more. This
means that there will be multiple peaks in the data distribution. Figure 2.13 shows a
possible bimodal distribution of daily maximum temperatures. There are two modes in
the underlying data, one at 23°C and one at 27°C. Knowing that there are two modes
in a data set provides information on the appearance of the underlying distribution, as
shown in Figure 2.13.
The geometric mean The geometric mean is similar to the arithmetic
mean, except that it is determined from the product of all the values, not the sum, and
the nth root of the product is taken, rather than dividing by n. Thus, the geometric mean
is written as shown in equation (2.4):
x xg ix xg ix x
i
n n
x xx xx xg ix xx xg ix xx xx xg ig ix xg ix xx xg ix xx xx xx xg ix xx xg ix xx xx xx xx xg ig ig ig ix xg ix xx xg ix xx xg ix xx xg ix xx xx xg ig ix xg ix xx xg ix x
1
1
(2.4)
It is most useful when looking at growth over time periods. For example, suppose an
individual had investments in a mutual fund over a period of twelve years, and the
fund experienced the growth rates shown in Table 2.6. The question one might like to
ask is: ‘What is the average annual growth rate over the twelve years?’ If one were to
estimate this using the arithmetic mean, one would obtain the answer that the average
growth rate is 5.85 per cent. However, the geometric mean produces a value of 5.77
per cent. Although this difference does not appear to be numerically large, it has a sig-
nificant effect on calculations of the value of the investment at the end of twelve years.
If one assumes that the actual initial investment was $10,000, then the actual fund
0
1
2
3
4
5
6
7
8
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
Maximum daily temperatures
Daysofoccurrence
Figure 2.13 Bimodal distribution of temperatures
Basic statistics and probability26
would stand at $19,609.82. This is exactly the result that would be obtained by using
the geometric mean. However, the arithmetic mean would estimate the fund as being
$19,782.92 – a difference of $173.10.
The arithmetic mean is obtained from equation (2.5):
x ( .( . . .. . . .. . . . . .. . .( .1( .( .1( .052052 1 067. .067. .11. .1. .103103 1 139. .139. .11. .1. .116116 1 065 11 059059. . .059. . .. . .059. . .11. . .1. . .038038. . .038. . .. . .038. . .1. . .1. . .002
1.. . . ) .) . .021 1 016. .016. .1. .1. .024) .12) .) .12) .702 12 1 0585. .. .11 016016. .016. .. .016. . ) .) .) .12) .) .12) .702702 1212
(2.5)
This produces an estimated annual average growth rate of 5.85 per cent. Using this
to estimate the actual value of the fund at the end of twelve years, assuming an initial
investment of $10,000, one would calculate equation (2.6):
V12  $10,000  (1.0585)12
 $10,000  1.978292  $19,782.92 (2.6)
The geometric mean is obtained from equation (2.7):
xg ( .( . . .. . . .. . . . . .. . .( .1( .( .1( .052052 1 067. .067. .11. .1. .. .1. .103103 1 139. .139. .11. .1. .. .1. .116116 1 065 11 059059. . .059. . .. . .059. . .11. . .1. . .. . .1. . .038038. . .038. . .. . .038. . .1. . .1. . .002
11 021 1 016 1 1 96058 1 0577
1
12
1
12. .021. .021 1. .1 . )024. )024 . .1. .1. .96058. .9605811 016016. .. .1. .11. .1 . .. .
(2.7)
This produces the estimated annual geometric mean growth rate of 5.77 per cent. To
estimate the value of the fund at the end of twelve years, one estimates in the same
manner as for the arithmetic mean, as in equation (2.8):
V12  $10,000  (1.0577)12
 $10,000  1.960982  $19,609.82 (2.8)
The reader can readily verify that this is identical to the amount calculated by apply-
ing each year’s growth rate, compounded, to the amount of the fund at the end of the
Table 2.6 Growth rates of an
investment fund, 1993–2004
Year
Growth
(percentage charge)
1993 5.20
1994 6.70
1995 10.30
1996 13.90
1997 11.60
1998 6.50
1999 5.90
2000 3.80
2001 0.20
2002 2.10
2003 1.60
2004 2.40
Describing data 27
previous year. Note that both the arithmetic and geometric means are obtained by using
the compounding formula (1  growth rate) to obtain the average rate of growth.
The harmonic mean The harmonic mean is obtained by summing the
inverse of the values for each observation, taking the inverse of this value, and multi-
plying the result by the number of observations. It may be written as shown in equa-
tion (2.9):
x
n
x
h
ii
n
11
/
(2.9)
The harmonic mean is used to estimate a mean from rates such as rates by time or
distance. A good example would be provided by estimating the average speed of a
train when the train’s speed changes every one kilometre, because of track condition,
signals, and congestion. Suppose that the speeds for each kilometre of a twenty kilo-
metre train trip were as shown in Table 2.7. If one were to take the arithmetic mean,
this would give a mean speed of 58.25 kilometres per hour (km/h). This would suggest
Table 2.7 Speeds by kilometre for a train
Kilometre
of trip
Speed
(km/h)
Time taken
(minutes)
1 40 1.5
2 45 1.333
3 55 1.091
4 60 1
5 70 0.857
6 65 0.923
7 50 1.2
8 35 1.714
9 40 1.5
10 60 1
11 70 0.857
12 80 0.75
13 100 0.6
14 90 0.667
15 70 0.857
16 60 1
17 60 1
18 45 1.333
19 40 1.5
20 30 2
Total – 22.683
Basic statistics and probability28
that the time taken for the trip was 20  60 / 58.25 minutes, or 20.6 minutes, when it
was actually 22.7 minutes (see Table 2.7). The harmonic mean is calculated as shown
in equation (2.10):
xg
20
1
40
1
45
1
55
1
60
1
70
1
65
1
50
1
35
1
40
1
60
1
70
1
80
1
100
11
90
1
70
1
60
1
60
1
45
1
40
1
30
20
0 37805
52 903
.
. (2.10)
This gives a harmonic mean speed of 52.903 km/h. Using this figure, rather than the
arithmetic mean speed, the time taken for the twenty kilometre trip is 20  60 / 52.903
minutes, or 22.7 minutes, which is the correct figure.
The quadratic mean The quadratic mean is also known as the root mean
square (RMS). It is given by summing the squared values of the observations, dividing
these by the number of observations, and taking the square root of the result, as shown
in equation (2.11):
RMS
x
n
ii
n 2
1 (2.11)
The quadratic mean is most often used with data whose arithmetic mean is zero. It
is often used for estimating error when the expected value of the average error is zero.
For example, suppose that one is assessing the accuracy of a machine that produces
ball bearings of nominally 100 millimetres (mm) in diameter. Measurements are taken
of a number of ball bearings, and the actual diameters found to be those shown in
Table 2.8, which also shows how much each one deviates from 100 mm.
The arithmetic mean of the deviations is 0.11 mm. However, the RMS is ±0.81
mm. The latter value gives a much clearer idea of the amount by which the ball bear-
ings actually deviate from the desired diameter, because it does not allow the negative
values to compensate for the positive ones. It shows, more precisely, the tolerance in
the manufacturing process.
Relationships between mean (arithmetic), median, and mode There
are relationships between the arithmetic mean (referred to hereafter as the mean), the
median, and the mode that can tell us more about the underlying data. In the tempera-
ture data from Table 2.3, it was found that the mean high temperature was 27.5°C, the
median was 27°C, and the mode occurred at 27°C. In this case, it can be seen that the
mode, median, and mean are all quite close. For the low temperatures, the mean was
19.7°C, the median was 20°C, and the mode was also 20°C. Again, the values are very
similar. In contrast, for the income data of Table 2.5, the mean is $37,474, the median
is $33,428, and the mode would be in the range $20,000–$24,999. These values are
not particularly close.
Describing data 29
For the mean, mode, and median to be the same value, the data must be distributed
symmetrically around the mean and median, and the distribution must be unimodal€–
i.e., have one mode€– which must occur at the mean value.
Plotting the temperature data, as shown in Figure 2.14 for the high temperatures
and Figure 2.15 for the low temperatures, shows distributions that are very nearly
symmetrical and that meet the conditions for a coincidence of mean, mode, and
median.
Using Sturges’ rule, with sixty observations on income, incomes should be grouped
into seven equal steps. This can be done by setting the intervals to $15,000. The result
is shown in Figure 2.16. In contrast to the temperature data, Figure 2.16 shows that
the data are not symmetrical but, rather, that they are skewed to the right, meaning
that there is a longer tail to the distribution to the right than to the left. This leads to a
median and a mode that are both below the mean.
Table 2.8 Measurements of ball bearings
Ball bearing Diameter (mm) Deviations
1 ╇ 98.5 −1.5
2 100.2 ╇ 0.2
3 99.6 −0.4
4 98.9 −1.1
5 100.6 0.6
6 100.3 0.3
7 100.7 0.7
8 99.1 −0.9
9 99.9 −0.1
10 101.1 1.1
0
1
2
3
4
5
Frequency
22 23 24 25 26 27 28 29 30 31 32 33
Maximum temperature
Figure 2.14╇ Distribution of maximum temperatures from Table 2.4
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys
Collecting managing and assessing data using sample surveys

More Related Content

Similar to Collecting managing and assessing data using sample surveys

software-testing-yogesh-singh (1).pdf
software-testing-yogesh-singh (1).pdfsoftware-testing-yogesh-singh (1).pdf
software-testing-yogesh-singh (1).pdfJhaKaustubh1
 
Workshop session 8 - Alternatives to CATI (1) non-probability online panels
Workshop session 8 - Alternatives to CATI (1) non-probability online panelsWorkshop session 8 - Alternatives to CATI (1) non-probability online panels
Workshop session 8 - Alternatives to CATI (1) non-probability online panelsThe Social Research Centre
 
2012Chen Ph D mixed method.pdf
2012Chen Ph D mixed method.pdf2012Chen Ph D mixed method.pdf
2012Chen Ph D mixed method.pdfMary Montoya
 
Salam_2007_Time_and_cost_overruns_on_PhD.pdf
Salam_2007_Time_and_cost_overruns_on_PhD.pdfSalam_2007_Time_and_cost_overruns_on_PhD.pdf
Salam_2007_Time_and_cost_overruns_on_PhD.pdfKarthikeyanP288430
 
Capstone Paper, Part I· Introduction (Completed in Week 1) ·.docx
Capstone Paper, Part I· Introduction (Completed in Week 1) ·.docxCapstone Paper, Part I· Introduction (Completed in Week 1) ·.docx
Capstone Paper, Part I· Introduction (Completed in Week 1) ·.docxjasoninnes20
 
Capstone Paper, Part I· Introduction (Completed in Week 1) ·.docx
Capstone Paper, Part I· Introduction (Completed in Week 1) ·.docxCapstone Paper, Part I· Introduction (Completed in Week 1) ·.docx
Capstone Paper, Part I· Introduction (Completed in Week 1) ·.docxannandleola
 
Eldon Phukuile (2015) Customer value creation in the South African mobile tel...
Eldon Phukuile (2015) Customer value creation in the South African mobile tel...Eldon Phukuile (2015) Customer value creation in the South African mobile tel...
Eldon Phukuile (2015) Customer value creation in the South African mobile tel...Eldon Phukuile
 
research methods in hci 內容簡介
research methods in hci 內容簡介research methods in hci 內容簡介
research methods in hci 內容簡介Derrick Yang
 
A Practical Guide To Scientific Data Analysis
A Practical Guide To Scientific Data AnalysisA Practical Guide To Scientific Data Analysis
A Practical Guide To Scientific Data AnalysisTracy Drey
 
Project ManagementProcesses, Methodologies, and Econ.docx
Project ManagementProcesses, Methodologies, and Econ.docxProject ManagementProcesses, Methodologies, and Econ.docx
Project ManagementProcesses, Methodologies, and Econ.docxanitramcroberts
 
Project ManagementProcesses, Methodologies, and Econ.docx
Project ManagementProcesses, Methodologies, and Econ.docxProject ManagementProcesses, Methodologies, and Econ.docx
Project ManagementProcesses, Methodologies, and Econ.docxwoodruffeloisa
 
Statistical Process Control -JOHN OAKLAND.pdf
Statistical Process Control -JOHN OAKLAND.pdfStatistical Process Control -JOHN OAKLAND.pdf
Statistical Process Control -JOHN OAKLAND.pdfImran Shahnawaz
 
UK HE Research Data Management Survey Results - Presentation to EPSRC
UK HE Research Data Management Survey Results - Presentation to EPSRCUK HE Research Data Management Survey Results - Presentation to EPSRC
UK HE Research Data Management Survey Results - Presentation to EPSRCMartin Hamilton
 
ACM WebSci 2018 presentation/発表資料
ACM WebSci 2018 presentation/発表資料ACM WebSci 2018 presentation/発表資料
ACM WebSci 2018 presentation/発表資料Yusuke Yamamoto
 
Smart survey
Smart surveySmart survey
Smart surveyStudying
 

Similar to Collecting managing and assessing data using sample surveys (20)

Kandil Mohammed Ibrahim_Sameh_Dissertation
Kandil Mohammed Ibrahim_Sameh_DissertationKandil Mohammed Ibrahim_Sameh_Dissertation
Kandil Mohammed Ibrahim_Sameh_Dissertation
 
software-testing-yogesh-singh (1).pdf
software-testing-yogesh-singh (1).pdfsoftware-testing-yogesh-singh (1).pdf
software-testing-yogesh-singh (1).pdf
 
Workshop session 8 - Alternatives to CATI (1) non-probability online panels
Workshop session 8 - Alternatives to CATI (1) non-probability online panelsWorkshop session 8 - Alternatives to CATI (1) non-probability online panels
Workshop session 8 - Alternatives to CATI (1) non-probability online panels
 
2012Chen Ph D mixed method.pdf
2012Chen Ph D mixed method.pdf2012Chen Ph D mixed method.pdf
2012Chen Ph D mixed method.pdf
 
Open pit mining.pdf
Open pit mining.pdfOpen pit mining.pdf
Open pit mining.pdf
 
Salam_2007_Time_and_cost_overruns_on_PhD.pdf
Salam_2007_Time_and_cost_overruns_on_PhD.pdfSalam_2007_Time_and_cost_overruns_on_PhD.pdf
Salam_2007_Time_and_cost_overruns_on_PhD.pdf
 
Capstone Paper, Part I· Introduction (Completed in Week 1) ·.docx
Capstone Paper, Part I· Introduction (Completed in Week 1) ·.docxCapstone Paper, Part I· Introduction (Completed in Week 1) ·.docx
Capstone Paper, Part I· Introduction (Completed in Week 1) ·.docx
 
Capstone Paper, Part I· Introduction (Completed in Week 1) ·.docx
Capstone Paper, Part I· Introduction (Completed in Week 1) ·.docxCapstone Paper, Part I· Introduction (Completed in Week 1) ·.docx
Capstone Paper, Part I· Introduction (Completed in Week 1) ·.docx
 
Eldon Phukuile (2015) Customer value creation in the South African mobile tel...
Eldon Phukuile (2015) Customer value creation in the South African mobile tel...Eldon Phukuile (2015) Customer value creation in the South African mobile tel...
Eldon Phukuile (2015) Customer value creation in the South African mobile tel...
 
research methods in hci 內容簡介
research methods in hci 內容簡介research methods in hci 內容簡介
research methods in hci 內容簡介
 
Stewart.pdf
Stewart.pdfStewart.pdf
Stewart.pdf
 
Ishii presentation
Ishii presentationIshii presentation
Ishii presentation
 
A Practical Guide To Scientific Data Analysis
A Practical Guide To Scientific Data AnalysisA Practical Guide To Scientific Data Analysis
A Practical Guide To Scientific Data Analysis
 
Foundationsf2f presv2[1]
Foundationsf2f presv2[1]Foundationsf2f presv2[1]
Foundationsf2f presv2[1]
 
Project ManagementProcesses, Methodologies, and Econ.docx
Project ManagementProcesses, Methodologies, and Econ.docxProject ManagementProcesses, Methodologies, and Econ.docx
Project ManagementProcesses, Methodologies, and Econ.docx
 
Project ManagementProcesses, Methodologies, and Econ.docx
Project ManagementProcesses, Methodologies, and Econ.docxProject ManagementProcesses, Methodologies, and Econ.docx
Project ManagementProcesses, Methodologies, and Econ.docx
 
Statistical Process Control -JOHN OAKLAND.pdf
Statistical Process Control -JOHN OAKLAND.pdfStatistical Process Control -JOHN OAKLAND.pdf
Statistical Process Control -JOHN OAKLAND.pdf
 
UK HE Research Data Management Survey Results - Presentation to EPSRC
UK HE Research Data Management Survey Results - Presentation to EPSRCUK HE Research Data Management Survey Results - Presentation to EPSRC
UK HE Research Data Management Survey Results - Presentation to EPSRC
 
ACM WebSci 2018 presentation/発表資料
ACM WebSci 2018 presentation/発表資料ACM WebSci 2018 presentation/発表資料
ACM WebSci 2018 presentation/発表資料
 
Smart survey
Smart surveySmart survey
Smart survey
 

Recently uploaded

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 

Recently uploaded (20)

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 

Collecting managing and assessing data using sample surveys

  • 1.
  • 2. Collecting, Managing, and Assessing Data Using Sample Surveys Collecting, Managing, and Assessing Data Using Sample Surveys provides a thorough, step-by-step guide to the design and imple- mentation of surveys. Beginning with a primer on basic statistics, the first half of the book takes readers on a comprehensive tour through the basics of survey design. Topics covered include the ethics of surveys, the design of survey procedures, the design of the survey instrument, how to write questions, and how to draw representative samples. Having shown readers how to design sur- veys, the second half of the book discusses a number of issues sur- rounding their implementation, including repetitive surveys, the economics of surveys, Web-based surveys, coding and data entry, data expansion and weighting, the issue of nonresponse, and the documenting and archiving of survey data. The book is an excel- lent introduction to the use of surveys for graduate students as well as a useful reference work for scholars and professionals. peter stopher is Professor of Transport Planning at the Institute of Transport and Logistics Studies at the University of Sydney. He has also been a professor at Northwestern University, Cornell University, McMaster University, and Louisiana State University. Professor Stopher has developed a substantial reputa- tion in the field of data collection, particularly for the support of travel forecasting and analysis. He pioneered the development of travel and activity diaries as a data collection mechanism, and has written extensively on issues of sample design, data expansion, nonresponse biases, and measurement issues.
  • 3.
  • 4.  Collecting, Managing, and Assessing Data Using Sample Surveys Peter Stopher
  • 5. CAMBRIDGE UNIVERSITY PRESS Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo, Delhi, Tokyo, Mexico City Cambridge University Press The Edinburgh Building, Cambridge CB2 8RU, UK Published in the United States of America by Cambridge University Press, New York www.cambridge.org Information on this title: www.cambridge.org/9780521681872 © Peter Stopher 2012 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2012 Printed in the United Kingdom at the University Press, Cambridge A catalogue record for this publication is available from the British Library ISBN 978-0-521-86311-7 Hardback ISBN 978-0-521-68187-2 Paperback Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.
  • 6.  To my wife, Carmen, with grateful thanks for your faith in me and your continuing support and encouragement.
  • 7.
  • 8. vii List of figures page╇ xix List of tables xxii Acknowledgements xxv 1 Introduction 1 1.1 The purpose of this book 1 1.2 Scope of the book 2 1.3 Survey statistics 4 2 Basic statistics and probability 6 2.1 Some definitions in statistics 6 2.1.1 Censuses and surveys 7 2.2 Describing data 8 2.2.1 Types of scales 8 Nominal scales 8 Ordinal scales 9 Interval scales 9 Ratio scales 10 Measurement scales 10 2.2.2 Data presentation: graphics 11 2.2.3 Data presentation: non-graphical 16 Measures of magnitude 17 Frequencies and proportions 17 Central measures of data 21 Measures of dispersion 34 The normal distribution 45 Some useful properties of variances and standard deviations 46 Proportions or probabilities 47 Data transformations 48 Covariance and correlation 50 Coefficient of variation 51 Contents
  • 9. Contentsviii Other measures of variability 53 Alternatives to Sturges’rule 62 3 Basic issues in surveys 64 3.1 Need for survey methods 64 3.1.1 A definition of sampling methodology 65 3.2 Surveys and censuses 65 3.2.1 Costs 66 3.2.2 Time 67 3.3 Representativeness 68 3.3.1 Randomness 69 3.3.2 Probability sampling 70 Sources of random numbers 71 3.4 Errors and bias 71 3.4.1 Sample design and sampling error 73 3.4.2 Bias 74 3.4.3 Avoiding bias 78 3.5 Some important definitions 78 4 Ethics of surveys of human populations 81 4.1 Why ethics? 81 4.2 Codes of ethics or practice 82 4.3 Potential threats to confidentiality 84 4.3.1 Retaining detail and confidentiality 85 4.4 Informed consent 86 4.5 Conclusions 89 5 Designing a survey 91 5.1 Components of survey design 91 5.2 Defining the survey purpose 93 5.2.1 Components of survey purpose 94 Data needs 94 Comparability or innovation 97 Defining data needs 99 Data needs in human subject surveys 99 Survey timing 100 Geographic bounds for the survey 101 5.3 Trade-offs in survey design 102 6 Methods for conducting surveys of human populations 104 6.1 Overview 104 6.2 Face-to-face interviews 105 6.3 Postal surveys 107
  • 10. Contents ix 6.4 Telephone surveys 108 6.5 Internet surveys 111 6.6 Compound survey methods 112 6.6.1 Pre-recruitment contact 112 6.6.2 Recruitment 113 Random digit dialling 115 6.6.3 Survey delivery 117 6.6.4 Data collection 118 6.6.5 An example 119 6.7 Mixed-mode surveys 120 6.7.1 Increasing response and reducing bias 123 6.8 Observational surveys 125 7 Focus groups 127 7.1 Introduction 127 7.2 Definition of a focus group 128 7.2.1 The size and number of focus groups 128 7.2.2 How a focus group functions 129 7.2.3 Analysing the focus group discussions 131 7.2.4 Some disadvantages of focus groups 131 7.3 Using focus groups to design a survey 132 7.4 Using focus groups to evaluate a survey 134 7.5 Summary 135 8 Design of survey instruments 137 8.1 Scope of this chapter 137 8.2 Question type 137 8.2.1 Classification and behaviour questions 138 Mitigating threatening questions 139 8.2.2 Memory or recall error 142 8.3 Question format 145 8.3.1 Open questions 145 8.3.2 Field-coded questions 146 8.3.3 Closed questions 147 8.4 Physical layout of the survey instrument 150 8.4.1 Introduction 150 8.4.2 Question ordering 153 Opening questions 153 Body of the survey 154 The end of the questionnaire 158 8.4.3 Some general issues on question layout 159 Overall format 160
  • 11. Contentsx Appearance of the survey 161 Front cover 162 Spatial layout 163 Choice of typeface 164 Use of colour and graphics 166 Question numbering 169 Page breaks 170 Repeated questions 171 Instructions 172 Show cards 174 Time of the interview 174 Precoding 174 End of the survey 175 Some final comments on questionnaire layout 176 9 Design of questions and question wording 177 9.1 Introduction 177 9.2 Issues in writing questions 178 9.2.1 Requiring an answer 178 9.2.2 Ready answers 180 9.2.3 Accurate recall and reporting 181 9.2.4 Revealing the data 182 9.2.5 Motivation to answer 183 9.2.6 Influences on response categories 184 9.2.7 Use of categories and other responses 185 Ordered and unordered categories 187 9.3 Principles for writing questions 188 9.3.1 Use simple language 189 9.3.2 Number of words 190 9.3.3 Avoid using vague words 191 9.3.4 Avoid using ‘Tick all that apply’ formats 193 9.3.5 Develop response categories that are mutually exclusive and exhaustive 193 9.3.6 Make sure that questions are technically correct 195 9.3.7 Do not ask respondents to say ‘Yes’ in order to say ‘No’ 196 9.3.8 Avoid double-barrelled questions 196 9.4 Conclusion 197 10 Special issues for qualitative and preference surveys 199 10.1 Introduction 199 10.2 Designing qualitative questions 199 10.2.1╇ Scaling questions 200
  • 12. Contents xi 10.3 Stated response questions 206 10.3.1╇ The hypothetical situation 206 10.3.2╇ Determining attribute levels 207 10.3.3╇ Number of choice alternatives or scenarios 207 10.3.4╇ Other issues of concern 208 Data inconsistency 208 Lexicographic responses 209 Random responses 209 10.4 Some concluding comments on stated response survey design 210 11 Design of data collection procedures 211 11.1 Introduction 211 11.2 Contacting respondents 211 11.2.1╇ Pre-notification contacts 211 11.2.2╇ Number and type of contacts 213 Nature of reminder contacts 213 Postal surveys 215 Postal surveys with telephone recruitment 216 Telephone interviews 217 Face-to-face interviews 219 Internet surveys 220 11.3 Who should respond to the survey? 221 11.3.1╇ Targeted person 221 11.3.2╇ Full household surveys 223 Proxy reporting 224 11.4 Defining a complete response 225 11.4.1╇ Completeness of the data items 226 11.4.2╇ Completeness of aggregate sampling units 228 11.5 Sample replacement 229 11.5.1╇ When to replace a sample unit 229 11.5.2╇ How to replace a sample 233 11.6 Incentives 235 11.6.1╇ Recommendations on incentives 236 11.7 Respondent burden 240 11.7.1╇ Past experience 241 11.7.2╇ Appropriate moment 242 11.7.3╇ Perceived relevance 242 11.7.4╇ Difficulty 243 Physical difficulty 243 Intellectual difficulty 244 Emotional difficulty 245 Reducing difficulty 246
  • 13. Contentsxii 11.7.5╇ External factors 246 Attitudes and opinions of others 246 The ‘feel good’effect 247 Appropriateness of the medium 248 11.7.6╇ Mitigating respondent burden 248 11.8 Concluding comments 250 12 Pilot surveys and pretests 251 12.1 Introduction 251 12.2 Definitions 252 12.3 Selecting respondents for pretests and pilot surveys 255 12.3.1╇ Selecting respondents 255 12.3.2╇ Sample size 258 Pilot surveys 258 Pretests 261 12.4 Costs and time requirements of pretests and pilot surveys 262 12.5 Concluding comments 264 13 Sample design and sampling 265 13.1 Introduction 265 13.2 Sampling frames 266 13.3 Random sampling procedures 268 13.3.1╇ Initial considerations 268 13.3.2╇ The normal law of error 269 13.4 Random sampling methods 270 13.4.1╇ Simple random sampling 271 Drawing the sample 271 Estimating population statistics and sampling errors 273 Example 276 Sampling from a finite population 279 Sampling error of ratios and proportions 279 Defining the sample size 281 Examples 283 13.4.2╇ Stratified sampling 285 Types of stratified samples 285 Study domains and strata 287 Weighted means and variances 287 Stratified sampling with a uniform sampling fraction 289 Drawing the sample 289 Estimating population statistics and sampling errors 290 Pre- and post-stratification 291 Example 293
  • 14. Contents xiii Equal allocation 294 Summary of proportionate sampling 295 Stratified sampling with variable sampling fraction 295 Drawing the sample 295 Estimating population statistics and sampling errors 296 Non-coincident study domains and strata 296 Optimum allocation and economic design 297 Example 298 Survey costs differing by stratum 300 Example 301 Practical issues in drawing disproportionate samples 303 Concluding comments on disproportionate sampling 305 13.4.3╇ Multistage sampling 305 Drawing a multistage sample 306 Requirements for multistage sampling 307 Estimating population values and sampling statistics 308 Example 309 Concluding comments on multistage sampling 314 13.5 Quasi-random sampling methods 314 13.5.1╇ Cluster sampling 316 Equal clusters: population values and standard errors 317 Example 319 The effects of clustering 321 Unequal clusters: population values and standard errors 322 Random selection of unequal clusters 324 Example 325 Stratified sampling of unequal clusters 326 Paired selection of unequal-sized clusters 327 13.5.2╇ Systematic sampling 328 Population values and standard errors in a systematic sample 328 Simple random model 329 Stratified random model 329 Paired selection model 329 Successive difference model 330 Example 330 13.5.3╇ Choice-based sampling 333 13.6 Non-random sampling methods 334 13.6.1╇ Quota sampling 334 13.6.2╇ Intentional, judgemental, or expert samples 335 13.6.3╇ Haphazard samples 335 13.6.4╇ Convenience samples 336 13.7 Summary 336
  • 15. Contentsxiv 14 Repetitive surveys 337 14.1 Introduction 337 14.2 Non-overlapping samples 338 14.3 Incomplete overlap 339 14.4 Subsampling on the second and subsequent occasions 341 14.5 Complete overlap: a panel 342 14.6 Practical issues in designing and conducting panel surveys 343 14.6.1╇ Attrition 344 Replacement of panel members lost by attrition 345 Reducing losses due to attrition 346 14.6.2╇ Contamination 347 14.6.3╇ Conditioning 348 14.7 Advantages and disadvantages of panels 348 14.8 Methods for administering practical panel surveys 349 14.9 Continuous surveys 352 15 Survey economics 356 15.1 Introduction 356 15.2 Cost elements in survey design 357 15.3 Trade-offs in survey design 359 15.3.1╇ Postal surveys 360 15.3.2╇�Telephone recruitment with a postal survey with or without telephone retrieval� 361 15.3.3╇ Face-to-face interview 362 15.3.4╇ More on potential trade-offs 362 15.4 Concluding comments 363 16 Survey implementation 365 16.1 Introduction 365 16.2 Interviewer selection and training 365 16.2.1╇ Interviewer selection 365 16.2.2╇ Interviewer training 368 16.2.3╇ Interviewer monitoring 369 16.3 Record keeping 370 16.4 Survey supervision 372 16.5 Survey publicity 373 16.5.1╇ Frequently asked questions, fact sheet, or brochure 374 16.6 Storage of survey forms 374 16.6.1╇ Identification numbers 375 16.7 Issues for surveys using posted materials 377 16.8 Issues for surveys using telephone contact 377 16.8.1╇ Caller ID 378 16.8.2╇ Answering machines 378
  • 16. Contents xv 16.8.3╇ Repeated requests for callback 380 16.9 Data on incomplete responses 381 16.10╇ Checking survey responses 382 16.11╇ Times to avoid data collection 383 16.12╇ Summary comments on survey implementation 383 17 Web-based surveys 385 17.1 Introduction 385 17.2 The internet as an optional response mechanism 388 17.3 Some design issues for Web surveys 389 17.3.1╇ Differences between paper and internet surveys 389 17.3.2╇ Question and response 390 17.3.3╇ Ability to fill in the Web survey in multiple sittings 392 17.3.4╇ Progress tracking 393 17.3.5╇ Pre-filled responses 394 17.3.6╇ Confidentiality in Web-based surveys 395 17.3.7╇ Pictures, maps, etc. on Web surveys 395 Animation in survey pictures and maps 396 17.3.8╇ Browser software 396 User interface design 396 Creating mock-ups 397 Page loading time 398 17.4 Some design principles for Web surveys 398 17.5 Concluding comments 399 18 Coding and data entry 401 18.1 Introduction 401 18.2 Coding 402 18.2.1╇ Coding of missing values 402 18.2.2╇ Use of zeros and blanks in coding 403 18.2.3╇ Coding consistency 404 Binary variables 404 Numeric variables 404 18.2.4╇ Coding complex variables 405 18.2.5╇ Geocoding 406 Requesting address details for other places than home 408 Pre-coding of buildings 409 Interactive gazetteers 410 Other forms of geocoding assistance 410 Locating by mapping software 411 18.2.6╇ Methods for creating codes 412 18.3 Data entry 413 18.4 Data repair 416
  • 17. Contentsxvi 19 Data expansion and weighting 418 19.1 Introduction 418 19.2 Data expansion 419 19.2.1╇ Simple random sampling 419 19.2.2╇ Stratified sampling 419 19.2.3╇ Multistage sampling 420 19.2.4╇ Cluster samples 420 19.2.5╇ Other sampling methods 421 19.3 Data weighting 421 19.3.1╇ Weighting with unknown population totals 422 An example 423 A second example 424 19.3.2╇ Weighting with known populations 426 An example 427 19.4 Summary 429 20 Nonresponse 431 20.1 Introduction 431 20.2 Unit nonresponse 432 20.2.1╇ Calculating response rates 432 Classifying responses to a survey 433 Calculating response rates 435 20.2.2╇ Reducing nonresponse and increasing response rates 440 Design issues affecting nonresponse 440 Survey publicity 442 Use of incentives 442 Use of reminders and repeat contacts 443 Personalisation 444 Summary 445 20.2.3╇ Nonresponse surveys 445 20.3 Item nonresponse 450 20.3.1╇ Data repair 450 Flagging repaired variables 451 Inference 452 Imputation 452 Historical imputation 453 Average imputation 454 Ratio imputation 454 Regression imputation 455 Cold-deck imputation 456 Hot-deck imputation 457 Expectation maximisation 457
  • 18. Contents xvii Multiple imputation 458 Imputation using neural networks 458 Summary of imputation methods 460 20.3.2╇ A final note on item nonresponse 460 Strategies to obtain age and income 461 Age 461 Income 462 21 Measuring data quality 464 21.1 Introduction 464 21.2 General measures of data quality 464 21.2.1╇ Missing value statistic 465 21.2.2╇ Data cleaning statistic 466 21.2.3╇ Coverage error 467 21.2.4╇ Sample bias 468 21.3 Specific measures of data quality 469 21.3.1╇ Non-mobility rates 469 21.3.2╇ Trip rates and activity rates 470 21.3.3╇ Proxy reporting 471 21.4 Validation surveys 472 21.4.1╇ Follow-up questions 473 21.4.2╇ Independent measurement 475 21.5 Adherence to quality measures and guidance 476 22 Future directions in survey procedures 478 22.1 Dangers of forecasting new directions 478 22.2 Some current issues 478 22.2.1╇ Reliance on telephones 478 Threats to the use of telephone surveys 479 Conclusions on reliance on telephones 481 22.2.2╇ Language and literacy 481 Language 481 Literacy 483 22.2.3╇ Mixed-mode surveys 486 22.2.4╇ Use of administrative data 487 22.2.5╇ Proxy reporting 488 22.3 Some possible future directions 489 22.3.1╇�A GPS survey as a potential substitute for a household travel survey� 493 The effect of multiple observations of each respondent on sample size 495
  • 19. Contentsxviii 23 Documenting and archiving 499 23.1 Introduction 499 23.2 Documentation or the creation of metadata 499 23.2.1╇ Descriptive metadata 500 23.2.2╇ Preservation metadata 503 23.2.3╇ Geospatial metadata 503 23.3 Archiving of data 506 References 511 Index� 525
  • 20. xix Figures 2.1 Scatter plot of odometer reading versus model year� page 12 2.2 Scatter plot of fuel type by body type 12 2.3 Pie chart of vehicle body types 13 2.4 Pie chart of household income groups 13 2.5 Histogram of household income 14 2.6 Histogram of vehicle types 14 2.7 Line graph of maximum and minimum temperatures for thirty days 15 2.8 Ogive of cumulative household income data from Figure 2.5 16 2.9 Relative ogive of household income 16 2.10 Relative step chart of household income 17 2.11 Stem and leaf display of income 22 2.12 Arithmetic mean as centre of gravity 24 2.13 Bimodal distribution of temperatures 25 2.14 Distribution of maximum temperatures from Table 2.4 29 2.15 Distribution of minimum temperatures from Table 2.4 30 2.16 Income distribution from Table 2.5 30 2.17 Distribution of vehicle counts 33 2.18 Box and whisker plot of income data from Table 2.5 36 2.19 Box and whisker plot of maximum temperatures 37 2.20 Box and whisker plot of minimum temperatures 37 2.21 Box and whisker plot of vehicles passing through the green phase 43 2.22 Box and whisker plot of children’s ages 45 2.23 The normal distribution 45 2.24 Comparison of normal distributions with different variances 46 2.25 Scatter plot of maximum versus minimum temperature 52 2.26 A distribution skewed to the right 54 2.27 A distribution skewed to the left 54 2.28 Distribution with low kurtosis 55 2.29 Distribution with high kurtosis 55 3.1 Extract of random numbers from the RAND Million Random Digits 72 4.1 Example of a consent form 87
  • 21. List of figuresxx 4.2 First page of an example subject information sheet 88 4.3 Second page of the example subject information sheet 89 5.1 Schematic of the survey process 92 5.2 Survey design trade-offs 103 6.1 Schematic of survey methods 113 8.1 Document file layout for booklet printing 162 8.2 Example of an unacceptable questionnaire format 164 8.3 Example of an acceptable questionnaire format 165 8.4 Excerpt from a survey showing arrows to guide respondent 168 8.5 Extract from a questionnaire showing use of graphics 169 8.6 Columned layout for asking identical questions about multiple people 171 8.7 Inefficient and efficient structures for organising serial questions 172 8.8 Instructions placed at the point to which they refer 173 8.9 Example of an unacceptable questionnaire format with response codes 175 9.1 Example of a sequence of questions that do not require answers 178 9.2 Example of a sequence of questions that do require answers 179 9.3 Example of a belief question 181 9.4 Example of a belief question with a more vague response 181 9.5 Two alternative response category sets for the age question 185 9.6 Alternative questions on age 186 9.7 Examples of questions with unordered response categories 187 9.8 An example of mixed ordered and unordered categories 188 9.9 Reformulated question from Figure 9.8 189 9.10 An unordered alternative to the question in Figure 9.8 189 9.11 Avoiding vague words in question wording 192 9.12 Example of a failure to achieve mutual exclusivity and exhaustiveness 194 9.13 Correction to mutual exclusivity and exhaustiveness 195 9.14 Example of a double negative 196 9.15 Example of removal of a double negative 196 9.16 An alternative that keeps the wording of the measure 197 9.17 An alternative way to deal with a double-barrelled question 197 10.1 Example of a qualitative question 200 10.2 Example of a qualitative question using number categories 200 10.3 Example of unbalanced positive and negative categories 201 10.4 Example of balanced positive and negative categories 201 10.5 Example of placing the neutral option at the end 202 10.6 Example of distinguishing the neutral option from ‘No opinion’ 202 10.7 Use of columned layout for repeated category responses 203 10.8 Alternative layout for repeated category responses 204 10.9 Statements that call for similar responses 204 10.10 Statements that call for varying responses 205 10.11 Rephrasing questions to remove requirement for ‘Agree’/‘Disagree’ 206 11.1 Example of a postcard reminder for the first reminder 215
  • 22. List of figures xxi 11.2 Framework for understanding respondent burden 241 14.1 Schematic of the four types of repetitive samples 338 14.2 Rotating panel showing recruitment, attrition, and rotation 353 18.1 An unordered set of responses requiring coding 402 18.2 A possible format for asking for an address 409 18.3 Excerpt from a mark-sensing survey 415 20.1 Illustration of the categorisation of response outcomes 436 20.2 Representation of a neural network model 459 23.1 Open archival information system model 508
  • 23. xxii 2.1 Frequencies and proportions of vehicle types� page 18 2.2 Frequencies, proportions, and cumulative values for household income 19 2.3 Minimum and maximum temperatures for a month (°C) 20 2.4 Grouped temperature data 21 2.5 Disaggregate household income data 22 2.6 Growth rates of an investment fund, 1993–2004 26 2.7 Speeds by kilometre for a train 27 2.8 Measurements of ball bearings 29 2.9 Number of vehicles passing through the green phase of a traffic light 32 2.10 Sorted number of vehicles passing through the green phase 32 2.11 Number of children by age 34 2.12 Deviations from the mean for the income data of Table 2.5 38 2.13 Outcomes from throwing the die twice 40 2.14 Sorted number of vehicles passing through the green phase 43 2.15 Deviations for vehicles passing through the green phase 44 2.16 Values of variance and standard deviation for values of p and q 47 2.17 Deviations for vehicles passing through the green phase raised to third and fourth powers 57 2.18 Deviations from the mean for children’s ages 58 2.19 Data on household size, annual income, and number of vehicles for forty households 59 2.20 Deviations needed for covariance and correlation estimates 61 3.1 Heights of 100 (fictitious) university students (cm) 76 3.2 Sample of the first and last five students 76 3.3 Sample of the first ten students 76 3.4 Intentional sample of ten students 77 3.5 Random sample of ten students (in order drawn) 77 3.6 Summary of results from Tables 3.2 to 3.5 77 6.1 Internet world usage statistics 112 Tables
  • 24. List of tables xxiii 6.2 Mixed-mode survey types (based on Dillman and Tarnai, 1991) 121 11.1 Selection grid by age and gender 222 13.1 Partial listing of households for a simple random sample 272 13.2 Excerpt of random numbers from the RAND Million Random Digits 273 13.3 Selection of sample of 100 members using four-digit groups from Table 13.2 274 13.4 Data from twenty respondents in a fictitious survey 276 13.5 Sums of squares for population groups 286 13.6 Data for drawing an optimum household travel survey sample 299 13.7 Optimal allocation of the 2,000-household sample 299 13.8 Optimal allocation and expected sampling errors by stratum 300 13.9 Results of equal allocation for the household travel survey 300 13.10 Given information for economic design of the optimal allocation 301 13.11 Preliminary sample sizes and costs for economic design of the optimum allocation 301 13.12 Estimation of the final sample size and budget 302 13.13 Comparison of optimal allocation, equal allocation, and economic design for $150,000 survey 302 13.14 Comparison of sampling errors from the three sample designs 303 13.15 Desired stratum sample sizes and results of recruitment calls 305 13.16 Distribution of departments and students 310 13.17 Two-stage sample of students from the university 311 13.18 Multistage sample using disproportionate sampling at the first stage 313 13.19 Calculations for standard error from sample in Table 13.18 315 13.20 Examples of cluster samples 316 13.21 Cluster sample of doctor’s files 320 13.22 Random drawing of blocks of dwelling units 326 13.23 Calculations for paired selections and successive differences 332 18.1 Potential complex codes for income categories 406 18.2 Example codes for use of the internet and mobile phones 407 19.1 Results of an hypothetical household survey 424 19.2 Calculation of weights for the hypothetical household survey 424 19.3 Two-way distribution of completed surveys 424 19.4 Two-way distribution of terminated surveys 425 19.5 Table 19.3 expressed as percentages 425 19.6 Sum of the cells in Tables 19.3 and 19.4 425 19.7 Cells of Table 19.6 as percentages 426 19.8 Weights derived from Tables 19.7 and 19.5 426 19.9 Results of an hypothetical household survey compared to secondary source data 427 19.10 Two-way distribution of completed surveys by percentage (originally shown in Table 19.5) 427 19.11 Results of factoring the rows of Table 19.10 428
  • 25. List of tablesxxiv 19.12 Second iteration, in which columns are factored 428 19.13 Third iteration, in which rows are factored again 429 19.14 Weights derived from the iterative proportional fitting 429 20.1 Final disposition codes for RDD telephone surveys 439 23.1 Preservation metadata elements and description 504
  • 26. xxv As is always the case, many people have assisted in the process that has led to this book. First, I would like to acknowledge all those, too numerous to mention by name, who have helped me over the years, to learn and understand some of the basics of design- ing and implementing surveys. They have been many and they have taught me much of what I now know in this field. However, having said that, I would particularly like to acknowledge those whom I have worked with over the past fifteen years or more on the International Steering Committee for Travel Survey Conferences (ISCTSC), who have contributed enormously to broadening and deepening my own understandings of surveys. In particular, I would like to mention, in no particular order, Arnim Meyburg, Martin Lee-Gosselin, Johanna Zmud, Gerd Sammer, Chester Wilmot, Werner Brög, Juan de Dios Órtuzar, Manfred Wermuth, Kay Axhausen, Patrick Bonnel, Elaine Murakami, Tony Richardson, (the late) Pat van der Reis, Peter Jones, Alan Pisarski, Mary Lynn Tischer, Harry Timmermans, Marina Lombard, Cheryl Stecher, Jean-Loup Madre, Jimmy Armoogum, and (the late) Ryuichi Kitamura. All these individuals have inspired and helped me and contributed in various ways to this book, most of them, probably, without realising that they have done so. I would also like to acknowledge the support I have received in this endeavour from the University of Sydney, and especially from the director of the Institute of Transport and Logistics Studies, Professor David Hensher. Both David and the university have provided a wide variety of support for the writing and production of this book, for which I am most grateful. However, most importantly, I would like to acknowledge the enormous support and encouragement from my wife, Carmen, and her patience, as I have often spent long hours on working on this book, and her unquestioning faith in me that I could do it. She has been an enduring source of strength and inspiration to me. Without her, I doubt that this book would have been written. As always, a book can see the light of day only through the encouragement and support of a publisher and those assisting in the publishing process. I would like to acknowledge Chris Harrison of Cambridge University Press, who first thought that this book might be worth publishing and encouraged me to develop the outline for Acknowledgements
  • 27. Acknowledgementsxxvi it, and then provided critical input that has helped to shape the book into what it has become. I would also like to thank profusely Mike Richardson, who carefully and thor- oughly copy-edited the manuscript, improving immensely its clarity and complete- ness. I would also like to thank Joanna Breeze, the production editor at Cambridge. She has worked with me with all the delays I have caused in the book production, and has still got this book to publication in a very timely manner. However, as always, and in spite of the help of these people, any errors that remain in the book are entirely my responsibility. Finally, I would like to acknowledge the contributions made by the many students I have taught over the years in this area of survey design. The interactions we have had, the feedback I have received, and the enjoyment I have had in being able to teach this material and see students understand and appreciate what good survey design entails have been most rewarding and have also contributed to the development of this book. I hope that they and future students will find this book to be of help to them and a contin- uing reference to some of those points that we have discussed. Peter Stopher Blackheath, New South Wales August 2011
  • 28. 1 1 Introduction 1.1â•… The purpose of this book There are a number of books available that treat various aspects of survey design, sam- pling, survey implementation, and so forth (examples include Cochran, 1963; Dillman, 1978, 2000; Groves and Couper, 1998; Kish, 1965; Richardson, Ampt, and Meyburg, 1995; andYates, 1965). However, there does not appear to be a single book that covers all aspects of a survey, from the inception of the survey itself through to archiving the data. This is the purpose of this book. The reader will find herein a complete treatment of all aspects of a survey, including all the elements of design, the requirements for testing and refinement, fielding the survey, coding and analysing the resulting data, documenting what happened, and archiving the data, so that nothing is lost from what is inevitably an expensive process. This book concentrates on surveys of human populations, which are both more chal- lenging generally and more difficult both to design and to implement than most sur- veys of non-human populations. In addition, because of the background of the author, examples are drawn mainly from surveys in the area of transport planning. However, the examples are purely illustrative; no background is needed in transport planning to understand the examples, and the principles explained are applicable to any survey that involves human response to a survey instrument. In spite of this focus on human partic- ipation in the survey process, there are occasional references to other types of surveys, especially observational and counting types of surveys. In writing this book, the author has tried to make this as complete a treatment as pos- sible. Although extensive references are included to numerous publications and books in various aspects of measuring data, the reader should be able to find all that he or she requires within the covers of this book. This includes a chapter on some basic aspects of statistics and probability that are used subsequently, particularly in the development of the statistical aspects of surveys. In summary, then, the purpose of this book is to provide the reader with an exten- sive and, as far as possible, exhaustive treatment of issues involved in the design and execution of surveys of human populations. It is the intent that, whether the reader is a student, a professional who has been asked to design and implement a survey, or
  • 29. Introduction2 someone attempting to gain a level of knowledge about the survey process, all ques- tions will be answered within these pages. This is undoubtedly a daunting task. The reader will be able to judge the extent to which this has been achieved. The book is also designed that someone who has no prior knowledge of statistics, probability, surveys, or the purposes to which surveys may be put can pick up and read this book, gaining knowledge and expertise in doing so. At the same time, this book is designed as a ref- erence book. To that end, an extensive index is provided, so that the user of this book who desires information on a particular topic can readily find that topic, either from the table of contents, or through the index. 1.2â•… Scope of the book As noted in the previous section, the book starts with a treatment of some basic statis- tics and probability. The reader who is familiar with this material may find it appro- priate to skip this chapter. However, for those who have already learnt material of this type but not used it for a while, as well as those who are unfamiliar with the material, it is recommended that this chapter be used as a means for review, refreshment, or even first-time learning. It is then followed by a chapter that outlines some basic issues of surveys, including a glossary of terms and definitions that will be found helpful in reading the remainder of the book. A number of fundamental issues, pertinent to overall survey design, are raised in this chapter. Chapter 4 introduces the topic of the ethics of surveys, and outlines a number of ethical issues and proposes a number of basic ethical standards to which surveys of human populations should adhere. The fifth chapter of the book discusses the primary issues of designing a survey. A major underlying theme of this chapter is that there is no such thing as an ‘all-purpose sur- vey’. Experience has repeatedly demonstrated that only surveys designed with a clear purpose in mind can be successful. The next nine chapters deal with all the various design issues in a survey, given that we have established the overall purpose or purposes of the survey. The first of these chapters (Chapter 6) discusses and describes all the current methods that are available for conducting surveys of human populations, in which people are asked to partic- ipate in the survey process. Mention is also made of some methods of dealing with other types of survey that are appropriate when the objects of the survey are observed in some way and do not participate in the process. In Chapter 7, the topic of focus groups is introduced, and potential uses of focus groups in designing quantitative and qualitative surveys are discussed. The chapter does not provide an exhaustive treat- ment of this topic, but does provide a significant amount of detail on how to organise and design focus groups. In Chapter 8, the design of survey instruments is discussed at some length. Illustrations of some principles of design are included, drawn princi- pally from transport and related surveys. Chapters 9 and 10 deal with issues relating to question design and question wording and special issues relating to qualitative and preference surveys. Chapter 11 deals with the design of data collection procedures themselves, including such issues as item and unit nonresponse, what constitutes a
  • 30. Scope of the book 3 complete response, the use of proxy reporting and its effects, and so forth. The seventh of this group of chapters (Chapter 12) deals with pilot surveys and pretests€– a topic that is too often neglected in the design of surveys. A number of issues in designing and undertaking such surveys and tests are discussed. Chapter 13 deals with the topic of sample design and sampling issues. In this chapter, there is extensive treatment of the statistics of sampling, including estimation of sampling errors and determination of sample sizes. The chapter describes most of the available methods of sampling, includ- ing simple random samples, stratified samples, multistage samples, cluster samples, systematic samples, choice-based samples, and a number of sampling methods that are often considered but that should be avoided in most instances, such as quota samples, judgemental samples, and haphazard samples. Chapter 14 addresses the topic of repetitive surveys. Many surveys are intended to be done as a ‘one-off’ activity. For such surveys, the material covered in the preceding chapters is adequate. However, there are many surveys that are intended to be repeated from time to time. This chapter deals with such issues as repeated cross-sectional sur- veys, panel surveys, overlapping samples, and continuous surveys. In particular, this chapter provides the reader with a means to compare the advantages and disadvantages of the different methods, and it also assists in determining which is appropriate to apply in a given situation. Chapter 15 builds on the material in the preceding chapters and deals with the issue of survey economics. This is one of the most troublesome areas, because, as many companies have found out, it is all too easy to be bankrupted by a survey that is under- taken without a real understanding and accounting of the costs of a survey. While information on actual costs will date very rapidly, this chapter attempts to provide rel- ative data on costs, which should help the reader estimate the costs of different survey strategies. This chapter also deals with many of the potential trade-offs in the design of surveys. Chapter 16 delves into some of the issues relating to the actual survey implemen- tation process. This includes issues relating to training survey interviewers and moni- toring the performance of interviewers, and the chapter discusses some of the danger signs to look for during implementation. This chapter also deals with issues regarding the ethics of survey implementation, especially the relationships between the survey firm, the client for the survey, and the members of the public who are the respondents to the survey. Chapter 17 introduces a topic that is becoming of increasing interest: Web-based surveys. Although this is a field that is as yet quite young, there are an increasing number of aspects that have been researched and from which the reader can benefit. Chapter 18 deals with the process of coding and data entry. A major issue in this topic is the geographic coding of places that may be requested in a survey. Chapter 19 addresses the topics of data expansion and weighting. Data expansion is outlined as a function of the sampling method, and statistical procedures for expanding each of the different types of sample are provided in this chapter. Weighting relates to problems of survey bias, resulting either from incomplete coverage of the population in the sampling process or from nonresponse by some members of the subject population.
  • 31. Introduction4 This is an increasingly problematic area for surveys of human populations, resulting from a myriad of issues relating to voluntary participation. Chapter 20 addresses the issue of nonresponse more completely. Here, issues of who is likely to respond and who is not are discussed. Methods to increase response rates are described, and refer- ence is made again to the economics of the survey design. The question of computing response rates is also addressed in this chapter. This is usually the most widely recog- nised statistic for assessing the quality of a survey, but it is also a statistic that is open to numerous methods of computation, and there is considerable doubt as to just what it really means. Chapter 21 deals with a range of other measures of data quality, some that are gen- eral and some, by way of example, that are specific to surveys in transport. These mea- sures are provided as a way to illustrate how survey-specific measures of quality can be devised, depending on the purposes of the survey. Chapter 22 discusses some issues of the future of human population surveys, especially in the light of emerging technol- ogies and their potential application and misapplication to the survey task. Chapter 23, the final chapter in the book, covers the issues of documenting and archiving the data. This all too often neglected area of measuring data is discussed at some length.A list of headings for the final report on the survey is provided, along with suggestions as to what should be included under the headings. The issue of archiving data is also addressed at some length. Data are expensive to collect and are rarely archived appropriately. The result is that many expensive surveys are effectively lost soon after the initial analyses are undertaken. In addition, knowledge about the survey is often lost when those who were most centrally involved in the survey move on to other assignments, or leave to work elsewhere. 1.3â•… Survey statistics Statistics in general, and survey statistics in particular, constitute a relatively young area of theory and practice. The earliest instance of the use of statistics is probably in the middle of the sixteenth century, and related to the start of data collection in France regarding births, marriages, and deaths, and in England to the collection of data on deaths in London each week (Berntson et al., 2005). It was then not until the middle of the eighteenth century that publications began to appear advancing some of the ear- liest theories in statistics and probability. However, much of the modern development of statistics did not take place until the late nineteenth and early twentieth centuries (Berntson et al., 2005): Beginning around 1880, three famous mathematicians, Karl Pearson, Francis Galton and Edgeworth, created a statistical revolution in Europe. Of the three mathematicians, it was Karl Pearson, along with his ambition and determination, that led people to consider him the founder of the twentieth- century science of statistics. It was only in the early twentieth century that most of the now famous names in sta- tistics made their contributions to the field. These included such statisticians as Karl
  • 32. Survey statistics 5 Pearson, Francis Galton, C. R. Rao, R. A. Fisher, E. S. Pearson, and Jerzy Neyman, among many others, who all made major contributions to what we know today as the science of statistics and probability. Survey sampling statistics is of even more recent vintage. Among the most notable names in this field of study are those of R. A. Fisher, Frank Yates, Leslie Kish, and W. G. Cochran. Fisher may have given survey sampling its birth, both through his own contributions and through his appointment of Frank Yates as assistant statistician at Rothamsted Experimental Station in 1931. In this post, Yates developed, often in col- laboration with Fisher, what may be regarded as the beginnings of survey sampling in the form of experimental designs (O’Connor and Robertson, 1997). His book Sampling Methods for Censuses and Surveys was first published in 1949, and it appears to be the first book on statistical sampling designs. Leslie Kish, who founded the Survey Research Institute at the University of Michigan, is also regarded as one of the founding fathers of modern survey sampling methods, and he published his seminal work, called Survey Sampling, in 1965. Close in time to Kish, W. G. Cochran published his seminal work, Sampling Techniques, in 1963. Based on these efforts, the science of survey sampling cannot be considered to be much over fifty years old€– a very new scientific endeavour. As a result of this rela- tive recency, there is still much to be done in developing the topic of survey sampling, while technologies for undertaking surveys have undergone and continue to undergo rapid evolution. The fact that most of the fundamental books on the topic are about forty years old suggests that it is time to undertake an updated treatise on the topic. Hence, this book has been undertaken.
  • 33. 6 2 Basic statistics and probability 2.1â•… Some definitions in statistics Statistics is defined by the Oxford Dictionary of English Etymology as ‘the political science concerned with the facts of a state or community’, and the word is derived from the German statistisch. The beginning of modern statistics was in the sixteenth century, when large amounts of data began to be collected on the populations of coun- tries in Europe, and the task was to make sense of these vast amounts of data. As statis- tics has evolved from this beginning, it has become a science concerned with handling large quantities of data, but also with using much smaller amounts of data in an effort to represent entire populations, when the task of handling data on the entire population is too large or expensive. The science of statistics is concerned with providing inputs to political decision making, to the testing of hypotheses (understanding what would happen if …), drawing inferences from limited data, and, considering the data limita- tions, doing all these things under conditions of uncertainty. A word used commonly in statistics and surveys is population. The population is defined as the entire collection of elements of concern in a given situation. It is also sometimes referred to as a universe. Thus, if the elements of concern are pre-school children in a state, then the population is all the pre-school children in the state at the time of the study. If the elements of concern are elephants in Africa, then the popula- tion consists of all the elephants currently in Africa. If the elements of concern are the vehicles using a particular freeway on a specified day, then the population is all the vehicles that use that particular freeway on that specific day. It is very clear that statistics is the study of data. Therefore, it is necessary to understand what is meant by data. The word data is a plural noun from the Latin datum, meaning given facts. As used in English, the word means given facts from which other facts may be inferred. Data are fundamental to the analysis and model- ling of real-world phenomena, such as human populations, the behaviour of firms, weather systems, astronomical processes, sociological processes, genetics, etc. Therefore, one may state that statistics is the process for handling and analysing data, such that useful conclusions can be drawn, decisions made, and new knowledge accumulated.
  • 34. Some definitions in statistics 7 Another word used in connection with statistics is observation. An observation may be defined as the information that can be seen about a member of a subject population. An observation comprises data about relevant characteristics of the member of the population. This population may be people, households, galaxies, private firms, etc. Another way of thinking of this is that an observation represents an appropriate group- ing of data, in which each observation consists of a set of data items describing one member of the population. A parameter is a quantity that describes some property of the population. Parameters may be given as numbers, proportions, or percentages. For example, the number of male pre-school children in the state might be 16,897, and this number is a parameter. The proportion of baby elephants in Africa might be 0.39, indicating that 39 per cent of all elephants in Africa at this time are babies. This is also a parameter. Sometimes, one can define a particular parameter as being critical to a decision. This would then be called a decision parameter. For example, suppose that a decision is to be made as to whether or not to close a primary school. The decision parameter might be the number of schoolchildren that would be expected to attend that school in, say, the next five years. A sample is some subset of a population. It may be a large proportion of the popu- lation, or a very small proportion of the population. For example, a survey of Sydney households, which comprise a population of about 1,300,000 might consist of 130,000, households (a 10 per cent sample) or 300 households (a 0.023 per cent sample). A statistic is a numerical quantity that describes a sample. It is therefore the equiva- lent of a parameter, but for a sample rather than the population. For example, a survey of 130,000 households in Sydney might have shown that 52 per cent of households own their own home or are buying it. This would be a statistic. If, on the other hand, a figure of 54 per cent was determined from a census of the 1,300,000 households, then this figure would be a parameter. Statistical inference is the process of making statements about a population based on limited evidence from a sample study. Thus, if a sample of 130,000 households in Sydney was drawn, and it was determined that 52 per cent of these owned or were purchasing their homes, then statistical inference would lead one to propose that this might mean that 676,000 (52 per cent of 1,300,000) households in Sydney own or are purchasing their homes. 2.1.1â•… Censuses and surveys Of particular relevance to this book is the fact that there are two methods for collect- ing data about a population of interest. The first of these is a census, which involves making observations of every member of the population. Censuses of the human pop- ulation have been undertaken in most countries of the world for many years. There are references in the Bible to censuses taken among the early Hebrews, and later by the Romans at the time of the birth of Christ. In Europe, most censuses began in the eigh- teenth century, although a few began earlier than that. In the United States of America,
  • 35. Basic statistics and probability8 censuses began in the nineteenth century. Many countries undertake a census once in each decade, either in the year ending in zero or in one. Some countries, such as Australia, undertake a census twice in each decade. A census may be as simple as a head count (enumerating the total size of the population) or it may be more complex, by collecting data on a number of characteristics of each member of the population, such as name, address, age, country of birth, etc. A survey is similar to a census, except that it is conducted on a subset of the popula- tion, not the entire population.A survey may involve a large percentage of the population or may be restricted to a very small sample of the population. Much of the science of survey statistics has to do with how one makes a small sample represent the entire popu- lation. This is discussed in much more detail in the next chapter. A survey, by definition, always involves a sample of the population. Therefore, to speak of a 100 per cent sample is contradictory; if it is a sample, it must be less than 100 per cent of the population. 2.2â•… Describing data One of the first challenges for statistics is to describe data. Obviously, one can provide a complete set of data to a decision maker. However, the human mind is not capable of utilising such information efficiently and effectively. For example, a census of the United States would produce observations on over 300 million people, while one of India would produce observations of over 1 billion people. A listing of those observa- tions represents something that most human beings would be incapable of utilising. What is required, then, is to find some ways to simplify and describe data, so that use- ful information is preserved but the sheer magnitude of the underlying data is hidden, thereby not distracting the human analyst or decision maker. Before examining ways in which data might be presented or described, such that the mind can grasp the essential information contained therein, it is important to under- stand the nature of different types of data that can be collected. To do this, it seems useful to consider the measurement of a human population, especially since that is the main topic of the balance of this book. In mathematical statistics, we refer to things called variables. A variable is a char- acteristic of the population that may take on differing or varying values for different members of the population. Thus, variables that could be used to describe members of a human population may include such characteristics as name, address, age or date of birth, place of birth, height, weight, eye colour, hair colour, and shoe size. Each of these characteristics provides differing levels of information that can be used in vari- ous ways. We can divide these characteristics into four different types of scales, a scale representing a way of measuring the characteristic. 2.2.1â•… Types of scales Nominal scales Each person in the population has a name. The person’s name represents a label by which that person can be identified, but provides little other information. Names can
  • 36. Describing data 9 be ordered alphabetically or can be ordered in any of a number of arbitrary ways, such as the order in which data are collected on individuals. However, no information is provided by changing the order of the names. Therefore, the only thing that the name provides is a label for each member of the population. This is called a nominal scale. A nominal scale is the least informative of the different types of scales that can be used to measure characteristics, but its lack of other information does not render it of less value. Other examples of nominal data are the colours of hair or eyes of the members of the population, bus route numbers, the numbers assigned to census collection dis- tricts, names of firms listed on a country’s stock exchange, and the names of magazines stocked by a newsagency. Ordinal scales Each person in the population has an address. The address will usually include a house number and a street name, along with the name of the town or suburb in which the house is located. The address clearly also represents a label, just as does the person’s name. However, in the case of the address, there is more information provided. If the addresses are sorted by number and by street, in most places in the world this will pro- vide additional information. These sorted addresses will actually help an investigator to locate each home, in that it is expected that the houses are arranged in numerical order along the street, and probably with odd numbers on one side of the street and even numbers on the other side. As a result, there is order information provided in the address. It is, therefore, known as an ordinal scale. However, if it is known that one person lives at 27 Main Street, and another person lives at 35 Main Street, this does not indicate how far apart these two people live. In some countries, they could be next door to each other, while in others there might be three houses between them or even seven houses between them (if numbering goes down one side of the street and back on the other). The only thing that would be known is that, starting at the first house on Main Street, one would arrive at 27 before one would arrive at 35. Therefore, order is the only additional information provided by this scale. Other examples of ordinal scales would be the list of months in the year, censor ratings of movies, and a list of runners in the order in which they finished a race. Interval scales Each person in the population has a shoe size. For the purposes of this illustration, the fact that there are slight inconsistencies in shoe sizes between manufacturers will be ignored, and it will be assumed, instead, that a man’s shoe size nine is the same for all men’s shoes, for example. Shoe size is certainly a label, in that a shoe can be called a size nine or a size twelve, and so forth. This may be a useful way of labelling shoes for a lot of different reasons. In addition, there is clearly order information, in that a size nine is smaller than a size twelve, and a size seven is larger than a size five. Furthermore, within each of children’s, men’s, and women’s shoes, each increase in a size represents a constant increase in the length of the shoe. Thus, the difference between a size nine and a size ten shoe for a man is the same as the difference between a size eight and a size nine, and so on for any two adjacent numbers. In other words,
  • 37. Basic statistics and probability10 there is a constant interval between each shoe size. On the other hand, there is no nat- ural zero in this scale (in fact, a size of zero generally does not exist), and it is not true that a size five is half the length of a size ten. Therefore, shoe size may be considered to be an interval scale. Women’s dress sizes in a number of countries also represent an interval scale, in which each increment in dress size represents a constant interval of increase in size of the dress, but a size sixteen dress is not twice as large as a size eight. In many cases, the sizing of an item of clothing as small, medium, large, etc. also represents an interval scale. Another example of an interval scale is the normal scale of temperature in either degrees Celsius or degrees Fahrenheit. An interval of one degree represents the same increase or decrease in temperature, whether it is between 40 and 41 or 90 and 91. However, we are not able to state that 60 degrees is twice as hot as 30€degrees. There is also not a natural zero on either the Celsius scale or the Fahrenheit scale. Indeed, the Celsius scale sets the temperature at which water freezes as 0, but the Fahrenheit scale sets this at 32, and there is not a particular physical property of the zero on the Fahrenheit scale. Ratio scale Each member of the population has a height and a weight. Again, each of these two measures could be used as a label. We might say that a person is 180 centimetres tall, or weighs 85 kilograms. These measures also contain ordinal information. We know that a person who weighs 85 kilograms is heavier than a person who weighs 67 kilo- grams. Furthermore, we know that these measures contain interval information. The difference between 179 centimetres and 180 centimetres is the same as the difference between 164 centimetres and 165 centimetres. However, there is even more information in these measures. There is ratio information. In other words, we know that a person who is 180 centimetres tall is twice as tall as a person who is 90 centimetres tall, and that a person weighing 45 kilograms is only half the weight of a person weighing 90 kilograms. There are two important new pieces of information provided by these mea- sures. First, there is a natural zero in the measurement scale. Both weight and height have a zero point, which represents the absence of weight or the absence of height. Second, there is a multiplicative relationship among the measures on the scale, not just an additive one. Therefore, both weight and height are described as ratio scales. Other examples of ratio scales are distance or length measures, measures of speed, measures of elapsed time, and so forth. However, it should be noted that measurement of clock time is interval-scaled (there is no natural zero, and 5 a.m. is not a half of 10 a.m.), while elapsed time is ratio-scaled, because zero minutes represents the absence of any elapsed time, and twenty minutes is twice as long as ten minutes, for example. Measurement scales The preceding sections have outlined four scales of measurement: nominal, ordinal, interval, and ratio. They have also demonstrated that these four scales are themselves an ordinal scale, in which the order, as presented in the preceding sentence, indicates increasing information content. Furthermore, each of the scale types, as ordered above,
  • 38. Describing data 11 contains the information of the previous type of scale, and then adds new information content. Thus an ordinal scale also has nominal information, but adds to that informa- tion on order; an interval scale has both nominal and ordinal information, but adds to that a consistent interval of measurement; and a ratio scale contains all nominal, ordi- nal, and interval information, but adds ratio relationships to them. There are two other ways in which scales can be described, because most scales can be measured in different ways. The first of these relates to whether the scale is contin- uous or discrete. A continuous scale is one in which the measurement can be made to any degree of precision desired. For example, we can measure elapsed time to the near- est hour, or minute, or second, or nanosecond, etc. Indeed, the only thing that limits the precision by which we can measure this scale is the precision of our instruments for measurement. However, there is no natural limit to precision in such cases. This is a continuous scale. A discrete scale, on the other hand, cannot be subdivided beyond a certain point. For example, shoe sizes are a discrete scale. Many shoe manufactur- ers will provide shoes in half-size increments, while others will provide them only in whole-size increments. Subdivision below half sizes simply is not done. Similarly, any measurement that involves counting objects, such as counting the number of members of a population, is a discrete scale. We cannot have fractional people, fractional houses, or fractional cars, for example. The second descriptor of a scale is whether it is inherently exact or approximate. By their nature, all continuous scales are approximate. This is so because we can always increase the precision of measurement. Generally, numbers obtained from counting are exact, unless the counting mechanism is capable of error. However, other discrete scales may be approximate or exact. In most clothing or shoe sizes, the measure would be considered approximate, because sizes often differ between manufacturers, and between countries. A size nine shoe is not the same size in the United States and in the United Kingdom, for example, nor is it necessarily the same size from two different shoe makers in the same country. It is important to recognise what type of a scale we are dealing with, when infor- mation is measured on scales, because the type of scale will also often either dictate how the information can be presented or restrict the analyst to certain ways of pre- sentation. Similarly, whether the measure is discrete or continuous will also affect the presentation of the data, as will, in some cases, whether the data are approximate or exact. 2.2.2â•… Data presentation: graphics It is appropriate to start with some simple rules about graphical presentations. There are four principal types of graphical presentation: scatter plots, pie charts, histograms or bar charts, and line graphs. A scatter plot is a plot of the frequency with which specific values of a pair of vari- ables occur in the data. Thus, the X-axis of the plot will contain the values of one of the variables that are found in the data, and the Y-axis will contain the values of the other
  • 39. Basic statistics and probability12 variable. As such, any type of measure can be presented on a scatter plot. However, if all values occur only once€– i.e., are unique to an observation€– then a scatter plot is of no particular interest. Therefore although any data can theoretically be plotted on a scatter plot, data that represent unique values, or data that are continuous, and also will probably have frequencies of only one or two at most for any pair of values, will not be illuminated by a scatter plot. An example of a scatter plot is provided in Figure 2.1, which shows a scatter plot of odometer readings of cars versus the model year of the vehicle. The Y-axis is a ratio- scaled variable, and the X-axis is an interval-scaled variable. The scatter plot indicates that there probably is a relationship between odometer readings and model year, such that the higher the model year value, the lower the odometer reading, as would be expected. This is a useful scatter plot. Figure 2.2 illustrates a scatter plot of two nominal-scaled variables: fuel type versus body type. It is not a very useful illustration of the data. First, we cannot tell how many points fall at each combination of values. Second, all it really tells us is that there are no taxis (body type 5) in this data set, that all vehicle types use petrol (fuel type 1), that all except motorcycles (body type 6) use diesel (fuel type 2), and that only cars (body 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1950 1960 1970 1980 1990 2000 2010 Model year Odometerreading Figure 2.1╇ Scatter plot of odometer reading versus model year 0 1 2 3 4 5 0 1 2 3 4 5 6 7 8 Body type Fueltype Figure 2.2╇ Scatter plot of fuel type by body type
  • 40. Describing data 13 type 1), four-wheel drive (4WD) vehicles (body type 2), and utility/van/panel vans (body type 3) use dual fuel (fuel type 4). This illustrates that nominal data€– both fuel type and body type are nominal scales€– may not produce a useful scatter plot. A pie chart is a circle that is divided into segments representing specific values in the data, with the length of the segment along the circumference of the circle indicating how frequently the value occurs in the data.Again, pie charts can be used with any type of data, when the information to be presented is the frequency of occurrence. However, they will generally not work with continuous data, unless the data are first grouped and converted to discrete categories. An example of a pie chart is provided in Figure 2.3. This shows that the pie chart works well for nominal data, in this case the vehicle body type from a survey of households. Figure 2.4 shows a pie chart for category data€– i.e., discrete data. The data are reported household incomes from a survey of households. The categories were those used in the survey. Income, being measured in dollars and with a natural zero, is actu- ally a ratio scale. In the categories collected, income is a ratio-scaled discrete measure. Again, the pie chart provides a good representation of the data. A histogram or bar chart is used for presenting discrete data. Such data will be interval- or ratio-scaled data. Histograms can be constructed in several different ways. When presenting complex information, bars can be stacked, showing how different 4WD Car Motorcycle Other Taxi Truck Utility vehicle Figure 2.3╇ Pie chart of vehicle body types None $1–$4,159 $4,160–$8,319 $8,320–$15,599 $15,600–$25,999 $26,000–$36,399 $36,400–$51,999 $52,000–$62,399 $62,400–$77,999 $78,000–$103,999 $104,000+ Don't know Refused Figure 2.4╇ Pie chart of household income groups
  • 41. Basic statistics and probability14 classes of items add up to a total within each bar. Bars can also be plotted so that each bar touches the next, or they may be plotted with gaps between. There is no particular rule for plotting bars in this manner, and it is more a matter of personal preference. Examples of two types of histograms are shown in Figures 2.5 and 2.6. Histograms can also be used to indicate the frequency of occurrence of specific values of both nominal and ordinal data. In this case, it is preferred that the bars do not touch, the spaces indi- cating that the scale is not interval or ratio. Figure 2.5 shows ratio-scaled discrete data on household incomes, this time in a two-dimensional histogram or bar chart. Note that the bars touch, indicating the under- lying continuous nature of the data. Figure 2.6 shows a histogram of nominal data frequencies of vehicle type for household vehicles. Two instructive observations may be made of this histogram. First, the dominance of the car tends to make the histogram 0 20 40 60 80 100 120 140 160 Numberof respondents N one$1–$4,159 $4,160–$8,319 $8,320–$15,599 $15,600–$25,999 $26,000–$36,399 $36,400–$51,999 $52,000–$62,399 $62,400–$77,999 $78,000–$103,999$104,000+ Annual income Figure 2.5╇ Histogram of household income 0 200 400 600 800 1,000 1,200 4WD Car Motorcycle Other Taxi Truck Utility vehicle Vehicle type Number Figure 2.6╇ Histogram of vehicle types
  • 42. Describing data 15 somewhat less useful. In contrast, the pie chart really communicated the information better. Second, the bars do not touch, in this case clearly indicating the discrete cate- gories of a nominal scale. The fourth type of chart is a line graph. This is much more restricted in application than the other types of charts. A line graph should be used only with continuous data, whether interval- or ratio-scaled. It is inappropriate to use line graphs to present data that are discrete, or data that are nominal or ordinal in nature. An example of a line graph is shown in Figure 2.7. Temperature is inherently a continuous measurement. It is therefore appropriate to use a line graph to present these data. This case demonstrates the use of two lines on the same graph. This allows one not only to see the maximum and minimum tempera- tures, but also to deduce that there may be a relationship between the two. A special type of line graph is an ogive. An ogive is a cumulative frequency line. Even when the original data are discrete in nature, the ogive can be plotted as a line, although a cumulative histogram is preferable. Generally, it makes sense to create cumulative graphs only of interval- or ratio-scaled data, although the data may be either discrete or continuous. Figure 2.8 shows an ogive for the income data used in Figure 2.5. The ogive is essentially an S-shaped curve, in that it starts with a line that is along the X-axis and ends with a line that is parallel to the X-axis, with the line climbing more or less continuously from the X-axis at the left to the top of the graph at the right. A special case of the ogive is a relative ogive, in which the proportions or percentage of observations are used, not the absolute counts, as in Figure 2.8. A relative ogive for 0 5 10 15 20 25 30 35 Sunday Monday Tuesday Wednesday Thursday Friday Saturday Sunday Monday Tuesday Wednesday Thursday Friday Saturday Sunday Monday Tuesday Wednesday Thursday Friday Saturday Sunday Monday Tuesday Wednesday Thursday Friday Saturday Sunday Monday Day of week Temperature(°C) Maximum temperature Minimum temperature Figure 2.7╇ Line graph of maximum and minimum temperatures for thirty days
  • 43. Basic statistics and probability16 the same data will have the same shape, but the scale of the Y-axis changes, as shown in Figure 2.9. A step chart, which is the discrete version of an ogive, could also be drawn for the income data. It can use either the count, the proportion, or the percentage for the Y-axis. A step chart is shown in Figure 2.10. 2.2.3â•… Data presentation: non-graphical Graphical presentations of data are very useful. As can be seen in the preceding sec- tion, the adage that ‘a picture is worth a thousand words’ is clearly interpretable as ‘a picture is worth a thousand numbers’. Indeed, one can grasp rather readily from 0 100 200 300 400 500 600 700 800 900 N one$1–$4,159 $4,160–$8,319$8,320–$15,599 $15,600–$25,999 $26,000–$36,399 $36,400-$51,999 $52,000–$62,399 $62,400–$77,999 $78,000–$103,999$104,000+ Household income Cumulativenumber Figure 2.8╇ Ogive of cumulative household income data from Figure 2.5 0 0.2 0.4 0.6 0.8 1 $0 $1–$4,159 $4,160–$8,319 $8,320–$15,599 $15,600–$25,999 $26,000–$36,399 $36,400–$51,999 $52,000–$62,399 $62,400–$77,999 $78,000–$103,999$104,000+ Household income Cumulativeproportion Figure 2.9╇ Relative ogive of household income
  • 44. Describing data 17 the graphs what is potentially a large amount of data, which the human mind would have difficulty grasping as raw data. However, pictures are not the only ways in which data can be presented for easier assimilation. There are also numeric ways to describe data. Ideally, what one would like would be some summary variables that would give one an idea about the magnitude of each variable in the data, the disper- sion of values, the variability of the values, and the symmetry or lack of symmetry in the data. Measures of magnitude These measures could include such concepts as frequencies of occurrence of particular values in the data, proportions of the data that possess a particular value, cumulative frequencies or proportions, and some form of average value. Each of these measures is considered separately. Frequencies and proportions Frequencies are simply counts of the number of times that a particular value occurs in the data, while proportions are frequencies divided by the total number of observations in the data. Table 2.1 shows the frequencies of occurrence of the different vehicle types used in the earlier illustrations of graphical presentations. For nominal data, cumulative frequencies or proportions are not sensible, because the scale does not contain any ordered information. Thus, to produce a cumulative frequency distribution for the entries in Table 2.1 would not make sense. Moreover, it should be noted that frequencies and proportions are generally sensible only for dis- crete data. However, if, in continuous data, there are large numbers of observations with the same value and the data set is large, then frequencies and proportions may possibly be useful. This, for example, might be the case for national data derived from a census. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 $0 $1–$4,159 $4,160–$8,319 $8,320–$15,599 $15,600–$25,999 $26,000–$36,399 $36,400–$51,999 $52,000–$62,399 $62,400–$77,999 $78,000–$103,999 $104,000+ Household income Cumulativeproportion Figure 2.10╇ Relative step chart of household income
  • 45. Basic statistics and probability18 For the income data plotted in Figure 2.5, both frequencies or proportions, and cumu- lative frequencies or proportions, make sense. These are shown in Table 2.2. From the information in Table 2.2 it is possible to grasp several things about the data on household income, such as the fact that the largest group is the one with $15,600–$25,999 annual income, followed by $78,000–$103,999 and $36,400–$51,999. It can also be seen that 16 per cent of households would not report their income. When the non-reported income is excluded, one can see that the proportions change substantially, and that a half of the population have incomes below $62,399. In effect, this table has summarised over 1,000 pieces of data and made them comprehensible, by presenting just a handful of numbers. In the case of the income data shown in Table 2.2, the groups were defined in the sur- vey itself. However, one may also take data that are collected as continuous measures and group them, both to display as a histogram and to present them in a table, similar to Table 2.2. In such a case, it is necessary to know into how many categories to group the data. Number of classes or categoriesâ•… Sturges’ rule (Sturges, 1926) provides guidance on how to determine the maximum number of classes into which to divide data, whether grouping already discrete data or continuous data. There are a number of elements to the rule. (1) Interval classes must be inclusive and non-overlapping. (2) Intervals should usually be of equal width, although the first and last interval may be open-ended for some types of data. (3) The number of classes depends on the number of observations in the data, accord- ing to equation (2.1): k = 1 + 3.322 × (log10â•›n) (2.1) where k = the number of classes, n = the number of observations. Suppose, for example, that the income data had been collected as actual annual income, and not in income classes. One might then ask the question as to how many Table 2.1 Frequencies and proportions of vehicle types Vehicle type Frequency Proportion Car 1,191 0.817 4WD 96 0.066 Utility vehicle 134 0.092 Truck 10 0.007 Taxi 0 0 Motorcycle 19 0.013 Other 7 0.005 Total 1457 1.000
  • 46. Describing data 19 classes would be the maximum that could be used for income. This would be obtained by substituting 900 into the above equation, because one should not include the miss- ing data. This would result in a value for k of 10.81, which would be truncated to 10. Therefore, Sturges’ rule would indicate that the maximum number of intervals that should be used for these data is ten. The data were actually collected in eleven classes. Therefore, this would suggest that the design was marginally appropriate and there should not be a need to group together any of the classes with the number of valid observations obtained. However, the intervals used violate Sturges’ rule in one respect, in that they are not of equal size. This is not uncommon with income grouping, where it is often the case, as here, that the lower incomes are divided into smaller classes than the higher incomes. This is generally done to keep the population of the classes more nearly equal. Suppose that the temperature data used in Figure 2.7 were to be grouped into clas- ses. The raw data are shown in Table 2.3. There are thirty observations of daily maxi- mum and minimum temperatures in this data set. Applying Sturges’ rule, the value of k is found to equal 5.92, suggesting that five intervals would be the most that could be used. For the high temperatures, the range is twenty-two to thirty-three. If this range is divided into groupings of two degrees, this would produce six intervals, while using three degrees would produce four intervals. In this case, given that k was found to be close to six, it would be best to use six intervals of two degrees per interval. For the low temperatures, the range is from sixteen to twenty-two. Grouping these also into groups of two degrees in size, which is preferable when one wants to look at both minimum Table 2.2 Frequencies, proportions, and cumulative values for household income Income range Frequency Proportion Cumulative frequency Cumulative proportion Including missing Excluding missing None 28 0.0262 28 0.0262 0.0311 $1–$4,159 2 0.0019 30 0.0280 0.0333 $4,160–$8,319 11 0.0103 41 0.0383 0.0456 $8,320–$15,599 67 0.0626 108 0.1009 0.1200 $15,600–$25,999 155 0.1449 263 0.2458 0.2922 $26,000–$36,399 97 0.0907 360 0.3364 0.4000 $36,400–$51,999 129 0.1206 489 0.4570 0.5433 $52,000–$62,399 72 0.0673 561 0.5243 0.6233 $62,400–$77,999 105 0.0981 666 0.6224 0.7400 $78,000–$103,999 133 0.1243 799 0.7467 0.8878 $104,000+ 101 0.0944 900 0.8411 1.0000 Don’t know 1 0.0009 901 0.8421 Refused 169 0.1579 1,070 1.0000 Total 1,070 1
  • 47. Basic statistics and probability20 and maximum temperatures on the same graph, or in side-by-side graphs, would result in four groups. Because this is less than the maximum of six, it is acceptable. In this case, grouping is sensible only if what one wants to do is to create a histogram of the frequency with which various maximum and minimum temperatures occur. Such a frequency table is shown in Table 2.4. There is a second variant of Sturges’ rule for binary data. This variant defines the number of classes, as shown in equation (2.2): k = 1 + log2(n) (2.2) Table 2.3 Minimum and maximum temperatures for a month (°C) Day Maximum temperature Minimum temperature Sunday 23 18 Monday 26 19 Tuesday 25 19 Wednesday 27 17 Thursday 32 22 Friday 29 21 Saturday 26 20 Sunday 27 19 Monday 30 22 Tuesday 31 21 Wednesday 33 23 Thursday 24 20 Friday 25 18 Saturday 27 19 Sunday 28 20 Monday 32 22 Tuesday 24 18 Wednesday 26 16 Thursday 25 17 Friday 22 17 Saturday 28 19 Sunday 27 20 Monday 28 20 Tuesday 29 21 Wednesday 28 20 Thursday 26 19 Friday 27 20 Saturday 30 21 Sunday 29 20 Monday 31 23
  • 48. Describing data 21 When n is less than 1,000, the two equations result in approximately the same num- ber of classes. For example, for 900 cases, this second formula gives k equal to 10.81, which is the identical result. For the thirty-observation case, the second formula gives 5.91, which is almost identical. It has been pointed out in various places (see Hyndman, 1995) that Sturges’ rule is good only for samples less than 200, and that it is based on a flawed argument. Nevertheless, it is still the standard used by most statistical software packages. There are two other rules that may be used, and these are discussed later in this chapter, because they utilise statistical measures that have not been discussed at this point. All the rules produce similar results for small samples, but diverge as the sample size becomes increasingly large. The other possible problems with Sturges’ rule are, first, that it may lead to over-smoothed data and, second, that its requirement for equal intervals may hide important information. Stem and leaf displaysâ•… Another way to display discrete data is to use a stem and leaf display. Essentially, the stem is the most aggregate level of grouping of the data, while the leaf is made up of more disaggregate data. Table 2.5 shows some household data when the actual income was collected, rather than having people respond to pre-defined classes. A stem and leaf display would be constructed, for example, by using the tens of thousands of dollars as the stem and the thousands as the leaf. This, like a histogram, provides a picture of the distribution of the data, as shown in Figure 2.11. This graphic shows clearly the nature of the distribution of incomes. Central measures of data There are at least six different averages that can be computed, which provide different ways of assessing the central value of the data. The six that are discussed here are: (1) arithmetic mean; (2) median; Table 2.4 Grouped temperature data Temperature range Number of highs Number of lows Cumulative number of highs Cumulative number of lows 16–17 0 4 0 ╇ 4 18–19 0 9 0 13 20–21 0 12 0 25 22–23 2 5 4 30 24–25 5 0 7 30 26–27 9 0 16 30 28–29 7 0 23 30 30–31 4 0 27 30 32–33 3 0 30 30
  • 49. Basic statistics and probability22 Stem Leaf 0 3 4 4 6 6 7 9 9 1 2 3 3 4 6 6 8 9 9 2 0 0 1 2 2 3 4 4 5 6 7 3 1 3 4 6 7 7 8 9 9 4 1 4 5 6 7 7 5 0 4 5 5 7 6 6 7 8 9 7 0 1 2 6 8 9 9 1 6 10 1 Figure 2.11╇ Stem and leaf display of income Table 2.5. Disaggregate household income data Household number Annual income Household number Annual income Household number Annual income 1 $22,358 21 $9,226 41 $70,135 2 $24,679 22 $96,435 42 $100,563 3 $37,455 23 $55,341 43 $3,877 4 $46,223 24 $89,367 44 $2,954 5 $22,790 25 $12,984 45 $6,422 6 $38,656 26 $21,444 46 $16,351 7 $49,999 27 $36,339 47 $19,222 8 $76,450 28 $20,105 48 $56,778 9 $53,744 29 $44,446 49 $41,237 10 $18,919 30 $34,288 50 $24,892 11 $44,881 31 $25,678 51 $31,084 12 $26,570 32 $4,122 52 $68,008 13 $12,135 33 $7,390 53 $71,039 14 $46,990 34 $65,809 54 $13,133 15 $37,855 35 $47,001 55 $18,259 16 $32,568 36 $23,874 56 $14,249 17 $8,917 37 $39,007 57 $36,898 18 $19,772 38 $67,445 58 $91,045 19 $72,455 39 $54,890 59 $6,341 20 $69,078 40 $22,378 60 $15,887
  • 50. Describing data 23 (3) mode; (4) geometric mean; (5) harmonic mean; and (6) quadratic mean. The arithmetic mean The arithmetic mean is simply the total of all the values in the data divided by the number of elements in the sample that provided valid values for the statistic. Mathematically, it is usually written as equation (2.3): x x n ii n 1 (2.3) In words, the mean of the variable x is equal to the sum of all the values of x in the data set, divided by the number of observations, n. It is important to note that values of x that contribute to the estimation of the mean are only those that are valid, and that n is also a count of the valid observations. Thus, in the income data we used previously, the missing values would be removed, and a mean, if it was calculated, would be based on 900 observations, not on the 1,070 survey returns. The sample mean – i.e., the value of the mean estimated from a sample of observa- tions – is normally denoted by the symbol x̅, while the true mean from the population is denoted by the Greek letter μ. It is a convention in statistics to use Greek letters to denote true population values, and the equivalent Roman letter to denote the sample estimate of that value. Put another way, the parameter is denoted by a Greek letter, and the statistic by the equivalent Roman letter. Using the temperature data from Table 2.3, the sum of the maximum temperatures is found to be 825, which yields an arithmetic mean of 27.5°C. Similarly, the sum of the minimum temperatures is 591, which gives an arithmetic mean of 19.7°C. In each of these cases there were thirty valid observations, so the total or sum was divided by thirty to give the arithmetic mean. Similarly, using the income data from Table 2.5, the sum of the incomes is $2,248,437. With sixty valid observations of income, the arith- metic mean of income is $37,474. The arithmetic mean (usually referred to simply as the mean, because it is the mean most often used) can also be understood by considering it as being the centre of gravity of the data. This is shown in Figure 2.12. In each of the two distributions shown in the figure, the fulcrum or balance point represents the mean. In the distribution on the left the mean is at thirteen, while in the one on the right it is at fourteen. Figure 2.12 illustrates two important facts. First, the symmetry or lack of it in a distri- bution of values will affect where the mean falls. Second, the arithmetic mean is influ- enced by extreme values. If the value of twenty were removed from the data distribution on the right of Figure 2.12, the mean would shift to thirteen. On the other hand, if the extreme value had been at twenty-five instead of twenty, the mean value would shift to
  • 51. Basic statistics and probability24 14.5. These changes come about by changing one out of nine observations, suggesting some substantial sensitivity of the mean to a relatively small change in the data. The medianâ•… The median is the central value of the data, or it can be defined to be the value for which half the data are above the value and half are below. For any data, the median value is most easily found by ordering the data in increasing or decreasing value and then finding the midpoint value. For the temperature data, this is seen fairly easily in the grouped data of Table 2.4. For the maximum temperature, the dividing point between the first fifteen values and the last fifteen values is found at 27°C, which is therefore the median value. Similarly, for the minimum temperatures, the median is 20°C. Note that the median must be a whole number of degrees in these cases, because the data are reported only in whole numbers of degrees. Note that the medians of each of these two variables are not exactly the same as the means, although they are very close. For already grouped data, the median must be a range. Looking back at the income data in Table 2.2, and using the cumulative proportions with the missing data excluded, it can be seen that the median falls in the interval $26,000–$36,399. For the income data in Table 2.5, the median can be an actual value. However, because there is an even number of observations, the median actually falls between the thirtieth and the thirty- Â�first observations, so between $32,568 and $34,288. By interpolation, the median would be $33,428. Comparing this to the mean, it is noted that the mean is quite a bit higher at $37,474. The modeâ•… The mode is the most frequently occurring value in a set of observations. For the maximum temperature data, the mode occurs at 27°C, for which there are five observations. For the minimum temperature, the mode occurs at 20°C, for which there are eight days on which this temperature occurs. For the income data in Table 2.2, the mode is $8,320–$15,599. This is quite different from the median. For the income data in Table 2.5, there is no mode for the ungrouped data, because each value is unique. To find a mode, it is necessary to group the data. This has, effectively, been done in the stem and leaf display, from which it can be determined that the mode 11 12 13 14 15 11 12 13 14 15 16 17 18 19 20 Figure 2.12╇ Arithmetic mean as centre of gravity Source: Ewart, Ford, and Lin (1982: 38).
  • 52. Describing data 25 is in the range of $20,000–$24,999, which contains eight households. Using classes of $5,000 for the ranges, there is no other range that has as many households in it. If ranges of $10,000 were used, then the mode would be in the range $20,000–$29,999. Unlike all the other mean values, there may be more than one mode. In fact, the limit on the number of modes that can occur is the number of observations, if each value occurs only once in the data set. However, this is not a useful result, and data in which each value occurs only once, as in the income data, should be grouped to provide more useful information. Data may be distributed bimodally or trimodally, or more. This means that there will be multiple peaks in the data distribution. Figure 2.13 shows a possible bimodal distribution of daily maximum temperatures. There are two modes in the underlying data, one at 23°C and one at 27°C. Knowing that there are two modes in a data set provides information on the appearance of the underlying distribution, as shown in Figure 2.13. The geometric mean The geometric mean is similar to the arithmetic mean, except that it is determined from the product of all the values, not the sum, and the nth root of the product is taken, rather than dividing by n. Thus, the geometric mean is written as shown in equation (2.4): x xg ix xg ix x i n n x xx xx xg ix xx xg ix xx xx xg ig ix xg ix xx xg ix xx xx xx xg ix xx xg ix xx xx xx xx xg ig ig ig ix xg ix xx xg ix xx xg ix xx xg ix xx xx xg ig ix xg ix xx xg ix x 1 1 (2.4) It is most useful when looking at growth over time periods. For example, suppose an individual had investments in a mutual fund over a period of twelve years, and the fund experienced the growth rates shown in Table 2.6. The question one might like to ask is: ‘What is the average annual growth rate over the twelve years?’ If one were to estimate this using the arithmetic mean, one would obtain the answer that the average growth rate is 5.85 per cent. However, the geometric mean produces a value of 5.77 per cent. Although this difference does not appear to be numerically large, it has a sig- nificant effect on calculations of the value of the investment at the end of twelve years. If one assumes that the actual initial investment was $10,000, then the actual fund 0 1 2 3 4 5 6 7 8 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 Maximum daily temperatures Daysofoccurrence Figure 2.13 Bimodal distribution of temperatures
  • 53. Basic statistics and probability26 would stand at $19,609.82. This is exactly the result that would be obtained by using the geometric mean. However, the arithmetic mean would estimate the fund as being $19,782.92 – a difference of $173.10. The arithmetic mean is obtained from equation (2.5): x ( .( . . .. . . .. . . . . .. . .( .1( .( .1( .052052 1 067. .067. .11. .1. .103103 1 139. .139. .11. .1. .116116 1 065 11 059059. . .059. . .. . .059. . .11. . .1. . .038038. . .038. . .. . .038. . .1. . .1. . .002 1.. . . ) .) . .021 1 016. .016. .1. .1. .024) .12) .) .12) .702 12 1 0585. .. .11 016016. .016. .. .016. . ) .) .) .12) .) .12) .702702 1212 (2.5) This produces an estimated annual average growth rate of 5.85 per cent. Using this to estimate the actual value of the fund at the end of twelve years, assuming an initial investment of $10,000, one would calculate equation (2.6): V12  $10,000  (1.0585)12  $10,000  1.978292  $19,782.92 (2.6) The geometric mean is obtained from equation (2.7): xg ( .( . . .. . . .. . . . . .. . .( .1( .( .1( .052052 1 067. .067. .11. .1. .. .1. .103103 1 139. .139. .11. .1. .. .1. .116116 1 065 11 059059. . .059. . .. . .059. . .11. . .1. . .. . .1. . .038038. . .038. . .. . .038. . .1. . .1. . .002 11 021 1 016 1 1 96058 1 0577 1 12 1 12. .021. .021 1. .1 . )024. )024 . .1. .1. .96058. .9605811 016016. .. .1. .11. .1 . .. . (2.7) This produces the estimated annual geometric mean growth rate of 5.77 per cent. To estimate the value of the fund at the end of twelve years, one estimates in the same manner as for the arithmetic mean, as in equation (2.8): V12  $10,000  (1.0577)12  $10,000  1.960982  $19,609.82 (2.8) The reader can readily verify that this is identical to the amount calculated by apply- ing each year’s growth rate, compounded, to the amount of the fund at the end of the Table 2.6 Growth rates of an investment fund, 1993–2004 Year Growth (percentage charge) 1993 5.20 1994 6.70 1995 10.30 1996 13.90 1997 11.60 1998 6.50 1999 5.90 2000 3.80 2001 0.20 2002 2.10 2003 1.60 2004 2.40
  • 54. Describing data 27 previous year. Note that both the arithmetic and geometric means are obtained by using the compounding formula (1  growth rate) to obtain the average rate of growth. The harmonic mean The harmonic mean is obtained by summing the inverse of the values for each observation, taking the inverse of this value, and multi- plying the result by the number of observations. It may be written as shown in equa- tion (2.9): x n x h ii n 11 / (2.9) The harmonic mean is used to estimate a mean from rates such as rates by time or distance. A good example would be provided by estimating the average speed of a train when the train’s speed changes every one kilometre, because of track condition, signals, and congestion. Suppose that the speeds for each kilometre of a twenty kilo- metre train trip were as shown in Table 2.7. If one were to take the arithmetic mean, this would give a mean speed of 58.25 kilometres per hour (km/h). This would suggest Table 2.7 Speeds by kilometre for a train Kilometre of trip Speed (km/h) Time taken (minutes) 1 40 1.5 2 45 1.333 3 55 1.091 4 60 1 5 70 0.857 6 65 0.923 7 50 1.2 8 35 1.714 9 40 1.5 10 60 1 11 70 0.857 12 80 0.75 13 100 0.6 14 90 0.667 15 70 0.857 16 60 1 17 60 1 18 45 1.333 19 40 1.5 20 30 2 Total – 22.683
  • 55. Basic statistics and probability28 that the time taken for the trip was 20  60 / 58.25 minutes, or 20.6 minutes, when it was actually 22.7 minutes (see Table 2.7). The harmonic mean is calculated as shown in equation (2.10): xg 20 1 40 1 45 1 55 1 60 1 70 1 65 1 50 1 35 1 40 1 60 1 70 1 80 1 100 11 90 1 70 1 60 1 60 1 45 1 40 1 30 20 0 37805 52 903 . . (2.10) This gives a harmonic mean speed of 52.903 km/h. Using this figure, rather than the arithmetic mean speed, the time taken for the twenty kilometre trip is 20  60 / 52.903 minutes, or 22.7 minutes, which is the correct figure. The quadratic mean The quadratic mean is also known as the root mean square (RMS). It is given by summing the squared values of the observations, dividing these by the number of observations, and taking the square root of the result, as shown in equation (2.11): RMS x n ii n 2 1 (2.11) The quadratic mean is most often used with data whose arithmetic mean is zero. It is often used for estimating error when the expected value of the average error is zero. For example, suppose that one is assessing the accuracy of a machine that produces ball bearings of nominally 100 millimetres (mm) in diameter. Measurements are taken of a number of ball bearings, and the actual diameters found to be those shown in Table 2.8, which also shows how much each one deviates from 100 mm. The arithmetic mean of the deviations is 0.11 mm. However, the RMS is ±0.81 mm. The latter value gives a much clearer idea of the amount by which the ball bear- ings actually deviate from the desired diameter, because it does not allow the negative values to compensate for the positive ones. It shows, more precisely, the tolerance in the manufacturing process. Relationships between mean (arithmetic), median, and mode There are relationships between the arithmetic mean (referred to hereafter as the mean), the median, and the mode that can tell us more about the underlying data. In the tempera- ture data from Table 2.3, it was found that the mean high temperature was 27.5°C, the median was 27°C, and the mode occurred at 27°C. In this case, it can be seen that the mode, median, and mean are all quite close. For the low temperatures, the mean was 19.7°C, the median was 20°C, and the mode was also 20°C. Again, the values are very similar. In contrast, for the income data of Table 2.5, the mean is $37,474, the median is $33,428, and the mode would be in the range $20,000–$24,999. These values are not particularly close.
  • 56. Describing data 29 For the mean, mode, and median to be the same value, the data must be distributed symmetrically around the mean and median, and the distribution must be unimodal€– i.e., have one mode€– which must occur at the mean value. Plotting the temperature data, as shown in Figure 2.14 for the high temperatures and Figure 2.15 for the low temperatures, shows distributions that are very nearly symmetrical and that meet the conditions for a coincidence of mean, mode, and median. Using Sturges’ rule, with sixty observations on income, incomes should be grouped into seven equal steps. This can be done by setting the intervals to $15,000. The result is shown in Figure 2.16. In contrast to the temperature data, Figure 2.16 shows that the data are not symmetrical but, rather, that they are skewed to the right, meaning that there is a longer tail to the distribution to the right than to the left. This leads to a median and a mode that are both below the mean. Table 2.8 Measurements of ball bearings Ball bearing Diameter (mm) Deviations 1 ╇ 98.5 −1.5 2 100.2 ╇ 0.2 3 99.6 −0.4 4 98.9 −1.1 5 100.6 0.6 6 100.3 0.3 7 100.7 0.7 8 99.1 −0.9 9 99.9 −0.1 10 101.1 1.1 0 1 2 3 4 5 Frequency 22 23 24 25 26 27 28 29 30 31 32 33 Maximum temperature Figure 2.14╇ Distribution of maximum temperatures from Table 2.4