Your SlideShare is downloading.
×

×

Saving this for later?
Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.

Text the download link to your phone

Standard text messaging rates apply

Like this document? Why not share!

- Experience with Formal Methods, Esp... by DVClub 464 views
- #5 formal methods – hoare logic by Sharif Omar Salem 198 views
- SurveyMonkey - Smart Survey Design... by Rodolfo Ohl 29634 views
- Pain mgtpdf by Lori Graham 5065 views
- Mcq 1060 questions by adrioz 16423 views
- Belgium Beer Market Insights 2011 by ReportLinker.com 40 views
- NASA Formal Methods Symposium by Daniela Remenska 773 views
- Philosophy report final by Marketing Utopia 2303 views
- IMBA Business Trip to Japan 2011 by IMBA7, Thammasat ... 1346 views
- Writing a dissertation by Marketing Utopia 2483 views
- Report writing instruction manual by Marketing Utopia 12059 views
- 5 Tips For Successfully Planning Yo... by Brilliant Transpo... 179 views

5,086

Published on

published by "www.marketing-utopia.tk"

published by "www.marketing-utopia.tk"

Published in:
Education

No Downloads

Total Views

5,086

On Slideshare

0

From Embeds

0

Number of Embeds

1

Shares

0

Downloads

0

Comments

0

Likes

6

No embeds

No notes for slide

- 1. DBA6000Quantitative Business Research Methods Rob J Hyndman
- 2. c Rob J Hyndman, 2008.Professor Rob HyndmanDepartment of Econometrics and Business StatisticsMonash University (Clayton campus)VIC 3800.Email: Rob.Hyndman@buseco.monash.edu.auTelephone: (03) 9905 2358www.robhyndman.info
- 3. ContentsPreface 51 Research design 9 1.1 Statistics in research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2 Organizing a quantitative research study . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.3 Some quantitative research designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.4 Data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.5 The survey process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Appendix A: Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 Data collection 23 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 Data collecting instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Errors in statistical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4 Questionnaire design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.5 Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.6 Sampling schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.7 Scale development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Appendix B: Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 Data summary 53 3.1 Summarising categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.2 Summarizing numerical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3 Summarising two numerical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.4 Measures of reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.5 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684 Computing and quantitative research 70 4.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.2 Using a statistics package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.4 SPSS exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765 Signiﬁcance 77 5.1 Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3
- 4. 5.2 Numerical differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816 Statistical models and regression 88 6.1 One numerical explanatory variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.2 One categorical explanatory variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.3 Several explanatory variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.4 Comparing regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.5 Choosing regression variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.6 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.7 SPSS exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1067 Signiﬁcance in regression 107 7.1 Statistical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.2 ANOVA tables and F-tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.3 t-tests and conﬁdence intervals for coefﬁcients . . . . . . . . . . . . . . . . . . . . . . 108 7.4 Post-hoc tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.5 SPSS exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1118 Dimension reduction 112 8.1 Factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 8.2 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1189 Data analysis with a categorical response variable 119 9.1 Chi-squared test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 9.2 Logistic and multinomial regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 9.3 SPSS exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12310 A survey of statistical methodology 12411 Further methods 131 11.1 Classiﬁcation and regression trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 11.2 Structural equation modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 11.3 Time series models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 11.4 Rank-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13412 Presenting quantitative research 135 12.1 Numerical tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 12.2 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Appendix: Good graphs for better business . . . . . . . . . . . . . . . . . . . . . . . . . . . 14113 Readings 145DBA6000: Quantitative Business Research Methods 4
- 5. PrefaceSubject convenorProfessor Rob J HyndmanB.Sc.(Hons), Ph.D., A.StatDepartment of Econometrics and Business StatisticsLocation: Room 671, Menzies Building, Clayton.Phone: (03) 9905 2358Email: Rob.Hyndman@buseco.monash.edu.auWWW: http://www.robhyndman.infoObjectivesOn completion of this subject, students should have: • the necessary quantitative skills to conduct high quality independent research related to business administration; • comprehensive grounding in a number of quantitative methods of data production and analysis; • been introduced to quantitative data analysis through a practical research activity.SynopsisThis unit considers the quantitative research methods used in studying business, managementand organizational analysis. Topics to be covered: 1. research design including experimental designs, observational studies, case studies, lon- gitudinal analysis and cross-sectional analysis; 2. data collection including designing data collection instruments, sampling strategies and assessing the appropriateness of archival data for a research purpose; 3. data analysis including graphical and numerical techniques for the exploration of large 5
- 6. Preface data sets and a survey of advanced statistical methods for modelling the relationships between variables; 4. communication of quantitative research; and 5. the use of statistical software packages such as SPSS in research.The effective use of several quantitative research methods will be illustrated through readingresearch papers drawn from several disciplines.ReferencesNone of these are required texts—they provide useful background material if you want to readfurther. Huck (2007) is excellent on interpreting statistical results in academic papers. Pallant(2007) is very helpful when using SPSS and in giving advice on how to write up research results.Use Wild and Seber (2000) if you need to brush up on your basic statistics; it contains lots ofhelpful advice and interesting examples. 1. H UCK , S.W. (2007) Reading statistics and research. 5th ed., Allyn & Bacon: Boston, MA 2. PALLANT, J. (2007) SPSS survival manual, 3rd ed., Allen & Unwin. 3. DE VAUS , D. (2002) Analyzing social science data. SAGE Publications: London. 4. W ILD , C.J., & S EBER , G.A.F. (2000) Chance encounters: a ﬁrst course in data analysis and inference. John Wiley & Sons: New York.Timetable 17 July Introduction/Chapter 1 24 July Chapters 2 31 July Chapter 3 7 August Chapter 4 SPSS tutorial 14 August Chapter 5 21 August Chapter 6 28 August Chapter 7 SPSS tutorial 4 September Chapter 8–9 SPSS tutorial 11 September Chapter 10 18 September Chapter 11–12 First assignment due 25 September No class 2 October No class 9 October SPSS tutorial 16 October Oral presentations Second assignment dueDBA6000: Quantitative Business Research Methods 6
- 7. PrefaceAssessment 1. A written report presenting and critiquing a research paper which uses quantitative re- search methods. 45% • It can be a published research paper from a scholarly journal, or a company report. It must contain substantial quantitative research. It must be approved in advance. • Your report should include comments on the research questions addressed, the ap- propriateness of the data used, how the data were collected, the method of analysis chosen, and the conclusions drawn. • Length: 4000–5000 words excluding tables and graphs. • Due: 17 September 2. A written report presenting some original quantitative analysis of a suitable multivariate data set. 45% • You may use your own data, or use data that I will provide. The data set must include at least four variables. It can be data from your workplace. • Your report should include comments on the research questions addressed, the ap- propriateness of the data used, how the data were collected, the method of analysis chosen, and the conclusions drawn. • You may use any statistical computing package or Excel for analysis. • Length: 4000–5000 words excluding tables and graphs. • Due: 15 October 3. A 20 minute oral presentation of one of the above reports. 10%. • On either 8 or 15 October.Assignment marking scheme • Research questions addressed: 6% • Appropriateness of data: 6% • Data collection: 6% • Description of statistical methods used: 6% • Suitability of statistical methods: 6% • Discussion of statistical results: 8% • Conclusions (are they supported/valid?): 7%Choosing a paper for Assignment 1Choose something you are interested in. For example, it can be an article you are reading aspart of your other DBA studies or something you have read as part of your professional life.The following journals contain some articles that would be suitable. There are also many others. • Australian Journal of Management • International Journal of Human Resource Management • Journal of Advertising • Journal of Applied Management Studies • Journal of Management • Journal of Management Accounting ResearchDBA6000: Quantitative Business Research Methods 7
- 8. Preface • Journal of Management Development • Journal of Managerial Issues • Journal of Marketing • Management DecisionYou can obtain online copies for some of these via the Monash Voyager Catalogue. Hard copiesshould be in the Monash library.Things to look for: • it should involve some substantial data analysis; • it should involve more than summary statistics (e.g., a regression model, or some chi- squared tests); • it should not use sophisticated statistical methods that are beyond this subject (e.g., avoid factor analysis and structural equation models).All papers should be approved by Rob Hyndman before you begin work on the assignment.Choosing a data set for Assignment 2 • Choose something you know about. The best data analyses involve a mix of good knowl- edge of the data context as well as good use of statistical methodology. • Don’t try to do too much. One response variable with 3–5 explanatory variables is usually sufﬁcient. Resist the temptation to write a long treatise! • You will ﬁnd it easier if the response variable is numeric. Analysing categorical response variables with several explanatory variables can be tricky. • Be clear about the purpose of your analysis. State some explicit objectives or hypotheses, and address them via your statistical analysis. • Think about what you include. A few well-chosen graphics that tell a story is better than pages of computer output that mean very little. • Start early. Even before we cover much methodology, you can do some basic data sum- maries and think about the key questions you want to address. • All data sets should be approved by Rob Hyndman before you begin work on the assign- ment.ReadingsMost weeks we will read a case study from a research journal and discuss the analysis. Pleaseread these in advance. We will discuss them in the third hour. You cannot use a paper wehave discussed for your ﬁrst assessment task. If you have a suggestion of a paper that may besuitable for class discussion, please let me know.DBA6000: Quantitative Business Research Methods 8
- 9. CHAPTER 1 Research design1.1 Statistics in research “Statistics is the study of making sense of data.” Ott and Mendenhall “The key principle of statistics is that the analysis of observations doesn’t depend only on the observations but also on how they were obtained.” Anonymous • Data beat anecdotes “For example” proves nothing. (Hebrew proverb) • Data beat intuition “Belief is no substitute for arithmetic.” (Henry Spencer) • Data beat “expert” opinion “When information becomes unavailable, the expert comes into his own.” (A.J. Liebling)1.1.1 Statistics answers questions using data • Do pollutants cause asthma? • Do transaction volumes on the stock market react to price changes? • Does deregulation reduce unemployment? • Does ﬂuoride reduce tooth decay?A deﬁnitionStatistical Analysis: Mysterious, sometimes bizarre, manipulations performed upon the col-lected data of an experiment in order to obscure the fact that the results have no generalizablemeaning for humanity. Commonly, computers are used, lending an additional aura of unrealityto the proceedings. (Source unknown) 97.3% of all statistics are made up. 9
- 10. Part 1. Research design1.1.2 Some statistics storiesThe Challenger disaster 2 Number of O-rings damaged 1 0 55 60 65 70 75 80 Ambient temperature at launchCharlie’s chooks 14 12 Y: Percentage mortality 10 8 6 4 0 20 40 60 80 100 X: Percentage Tegel birdsDBA6000: Quantitative Business Research Methods 10
- 11. Part 1. Research designRisk factors for heart diseaseA doctor wants to investigate who is most at risk for coronary-related deaths. He selects 12patients at random from his clinic and records their age, blood pressure and drug used. Healso records whether they eventually died from heart disease or not. Age BP Drug L/D 18 68 1 D 20 64 2 L 22 72 1 D 25 67 2 L 29 80 – D 33 70 – D 34 86 1 D 36 85 – D 37 73 2 L 39 82 – L 41 90 1 D 45 87 2 L Drug Lived Died % lived 1 0 4 0% 2 4 0 100% – 1 3 25% 5 7Drug 1 looks bad, 2 looks good.DBA6000: Quantitative Business Research Methods 11
- 12. Part 1. Research design1.1.3 Causation and associationSmoking and Lung CancerThere is a strong positive correlation between smoking and lung cancer. There are severalpossible explanations. • Causal hypothesis: Smoking causes lung cancer. • Genetic hypothesis: There is a hereditary trait which predisposes people to both nicotine addiction and lung cancer. • Sloppy lifestyle hypothesis: Smoking is most prevalent amongst people who also drink too much, don’t exercise, eat unhealthy food, etc.Postnatal careMothers who return home from hospital soon after birth do better than those who stay inhospital longer. • Causation hypothesis: Hospital is harmful and/or home is helpful. • Common response hypothesis: Mothers return home early because they are coping well. • Confounding hypothesis: Mothers return home early if there is someone at home to help.University applicants Male Female Total Accept 70 40 110 Reject 100 100 200 Total 170 140 310Is there evidence of discrimination?Course: Introduction to bean counting Male Female Total Accept 60 20 80 Reject 60 20 80 Total 120 40 160DBA6000: Quantitative Business Research Methods 12
- 13. Part 1. Research designCourse: Advanced welding Male Female Total Accept 10 20 30 Reject 40 80 120 Total 50 100 150 This is an example of Simpson’s Paradox. Simpson’s Paradox occurs when the association between variables is reversed when data from several groups are combined.Other examples of Simpsons’ paradox • Average tax rate has increased with time even though rate in every income category has decreased. Why? • Ave. female salary of B.Sc. graduates is lower than ave. male salary. Why? Causality or association? 1. A positive correlation between blood pressure and income is observed. Does this indicate a causal connection? 2. In a survey in 1960, it was found that for 25–34 y.o. males there was a positive correlation between years of school completed and height. Does going to school longer make a man taller? 3. The same survey showed a negative correlation between age and educational level for persons aged over 25. Why? 4. Students at fee paying private schools perform better on average in VCE than students at government funded schools. Why?Some subtle differences • Distinguish between: causation & association, prediction & causation, prediction & ex- planation. • Note difference between deterministic and probabilistic causation.DBA6000: Quantitative Business Research Methods 13
- 14. Part 1. Research design1.2 Organizing a quantitative research studyAs a quick check, ask the following questions 1. What is your hypothesis (your research question)? 2. What is already known about the problem (literature review)? 3. What sort of design is best suited to studying your hypothesis? (method) 4. What data will you collect to test your hypothesis? (sample) 5. How will you analyse these data? (data analysis) 6. What will you do with the results of the study? (communication)These questions are broken down in more detail below. (These are mostly taken from Rubin etal. (1990), and have also appeared in Balnaves and Caputi (2001).)1.2.1 Hypothesis • What is the goal of the research? • What is the problem, issue, or critical focus to be researched? • What are the important terms? What do they mean? • What is the signiﬁcance of the problem? • Do you want to test a theory? • Do you want to extend a theory? • Do you want to test competing theories? • Do you want to test a method? • Do you want to replicate a previous study? • Do you want to correct previous research that was conducted in an inadequate manner? • Do you want to resolve inconsistent results from earlier studies? • Do you want to solve a practical problem? • Do you want to add to the body of knowledge in another manner?1.2.2 Review of literature • What does previous research reveal about the problem? • What is the theoretical framework for the investigation? • Are there complementary or competing theoretical frameworks? • What are the hypotheses and research questions that have emerged from the literature review?DBA6000: Quantitative Business Research Methods 14
- 15. Part 1. Research design1.2.3 Method • What methods or techniques will be used to collect the data? (This holds for applied and non-applied research) • What procedures will be used to apply the methods or techniques? • What are the limitations of these methods? • What factors will affect the study’s internal and external validity? • Will any ethical principles be jeopardized?1.2.4 Sample • Who (what) will provide (constitute) the data for the research? • What is the population being studied? • Who will be the participants for the research? • What sampling technique will be used? • What materials and information are necessary to conduct the research? • How will they be obtained? • What special problems can be anticipated in acquiring needed materials and information? • What are the limitations in the availability and reporting of materials and information?1.2.5 Data analysis • How will data be analysed? • What statistics will be used? • What criteria will be used to determine whether hypotheses are supported? • What was discovered (about the goal, data, method, and data analysis) as a result of doing preliminary work (if conducted)?1.2.6 Communication • How will the ﬁnal research report be organised? (Outline) • What sources have you examined thus far that pertain to your study? (Reference list) • What additional information does the reader need? • What time frame (deadlines) have you established for collecting, analysing and present- ing data? (Timetable)1.3 Some quantitative research designs • Case study: questionnaire, interview, observation. Best for exploratory work and hy- pothesis generation. Limited quantitative analysis possible. • Survey: questionnaire, interview, observation. Best if sample is random. • Experiment: questionnaire, interview, observation. Best for demonstrating causality.DBA6000: Quantitative Business Research Methods 15
- 16. Part 1. Research design1.3.1 Cross-sectional vs longitudinal analysisAll designs can be either cross-sectional or longitudinal. • Cross-sectional design involves data collection for one time only. • Longitudinal design involves successive data collection over a period of time. Necessary if you want to study changes over time.1.3.2 Case study designs • involves intense involvement with a few cases rather than limited involvement with many cases • can’t generalize results easily • useful in exploring ideas and generating hypotheses1.3.3 Survey designs • Most popular in business/management research • useful when you cannot control the things you want to study • difﬁcult to get random and representative samples1.3.4 Experimental designs • requires control group to allow for the placebo effect • requires the experimenter to control all variables other than the variable of interest • requires randomization to groups • allows causation to be tested Which research design would you use? Hypotheses: 1. Women believe they are better at managing than men. 2. Children who listen to poetry in early childhood make better progress in learn- ing to read than those who do not. 3. A business will run more efﬁciently if no person is directly responsible for more than ﬁve other people. 4. There are inherent advantages in businesses staying small. 5. Employees with postgraduate qualiﬁcations have shorter job expectancy than employees without postgraduate qualiﬁcations. What data would you collect in each case?DBA6000: Quantitative Business Research Methods 16
- 17. Part 1. Research design1.4 Data structure1.4.1 Populations and samplesA population is the entire collection of ‘things’ in which we are interested. A sample is a subset ofa population. We wish to make an inference about a population of interest based on informationobtained from a sample from that population.E XAMPLES : • You measure the proﬁt/loss of 50 public hospitals in Victoria, randomly selected. Population: Sample: Points of interest: • Sales on 500 products from one company for the last 5 years are analysed. Population: Sample: Points of interest:1.4.2 Cases and variablesThink about your data in terms of cases and variables. • A case is the unit about which you are taking measurements. E.g., a person, a business. • A variable is a measurement taken on each case. E.g., age, score on test, grade-level, income.1.4.3 Types of DataThe ways of organizing, displaying and analysing data depends on the type of data we areinvestigating. • Categorical Data (also called nominal or qualitative) e.g. sex, race, type of business, postcode Averages don’t make sense. Ordered categories are called ordinal data • Numerical Data (also called scale, interval and ratio) e.g. income, test score, age, weight, temperature, time. Averages make sense.Note that we sometimes treat numerical data as categories. (e.g. three age groups.)DBA6000: Quantitative Business Research Methods 17
- 18. Part 1. Research design1.4.4 Response and explanatory variablesResponse variable: measures the outcome of a study. Also called dependent variable.Explanatory variable: attempts to explain the variation in the observed outcomes. Also called independent variables. Many statistical problems can be thought of in terms of a response variable and one or more explanatory variables.Sometimes the response variable is called the dependent variable and the explanatory variablesare called the independent variables. • Study of proﬁt/loss in Victorian hospitals. Response variable: Explanatory variables: • Monthly sales of 500 products Response variable: Explanatory variables: competitor advertising.1.5 The survey process1. Planning a survey State the objectives: In order to state the objectives we often need to ask questions such as: • What is the survey’s exact purpose? • What do we not know and want to know? • What inferences do we need to draw? Begin by developing a speciﬁc list of information needs. Then write focused survey ques- tions.2. Design the sampling procedure Identify the target population: Whom are we drawing conclusions about? Select a sampling scheme: Examples: simple random sampling, stratiﬁed random sampling, systematic sampling, and cluster sampling.3. Select a survey method Decide how to collect the data: personal interviews, telephone interviews, mailed ques- tionnaires, diaries, . . .4. Develop the questionnaire Write the questionnaire. Decide on the wording, types of questions, and other issues.5. Pretest the questionnaire Select a very small sample from the sampling frame. Conduct the survey and see what goes wrong. Correct any problems before carrying out the full-scale study.6. Conduct the survey Run the survey in an efﬁcient and time effective manner.7. Analyze the data Gather the results and determine outcomes.DBA6000: Quantitative Business Research Methods 18
- 19. Part 1. Research designAppendix A: Case studiesInjury management in NSWFour injury management pilots (IMP) running during 2001: • private hospitals and nursing homes within NSW; • all industry groups within the Central West NSW region; • two insurance companies (QBE and EML).We wish to do a statistical comparison of the injury management pilots with the current stan-dard injury management arrangements.Performance measures • incidence of speciﬁc payment types • duration of claims • number of claims • proportion of claimants in receipt of weekly beneﬁts at 4, 8, 13 and 26 weeks. • costs for claimants at 4, 8, 13 and 26 weeks. – medical, rehabilitation, physiotherapy, chiropractic – weekly-beneﬁts • timeliness – number of days from injury to agent notiﬁcation – number of days from injury to ﬁrst paymentSome potential driving variables • age • gender • injury type • agency (e.g., powered tools) • severity of injury • medical interventions • employer size • insuring agency • weekly pay at time of injury • industry (ANZSIC code) • occupation (ASCO code) • Driving variables affect the performance measures. • Variations between groups in key driver variables can induce apparent differences be- tween groups. This is then confused with any real differences due to the programs being evaluated. • Therefore any comparisons of groups of employees should either eliminate the effect of drivers or try to measure the effect of the drivers.DBA6000: Quantitative Business Research Methods 19
- 20. Part 1. Research designThe ideal design!Ideally, we would use a randomized control trial. This eliminates the effect of driving vari-ables. • The control group would be employees on the old IM system. • The treatment group would be employees in the new IMP. • Employees would be randomly allocated to the two groups. • Statistical comparisons between the two groups would show differences between the old IM system and the new IMP. • This random allocation would prevent any systematic differences between those in the IMP and those not in the IMP. • Such a scheme is impracticable.The actual designWe have to use pseudo-control groups and eliminate differences between the control and IMPgroups using statistical models. • All injuries within the speciﬁed industry group, geographical region or insurer will be subject to the new IMP during 2001. • The pseudo-controls will be the equivalent groups of employees in 2000 who are not subject to the new IMP.Problem of confounding • If there are differences between the IMP and the control, is it due to the different IM program or the different group?Solution: • adjust for as many driving variables as possible; • compare similar groups not subject to the IMP.Comparisons undertakenIMP group: Private hospitals/nursing homes in NSW 2001 Pseudo-control: Private hospitals/nursing homes 2000IMP group: Central West NSW region 2001 Pseudo-controls: Central West NSW region 2000IMP group: Insurance company 2001 Pseudo-control: Insurance company 2000Non-IMP group: Comparable industry group 2001 Pseudo-controls: Comparable industry group 2000Non-IMP group: Comparable NSW region 2001 Pseudo-controls: Comparable NSW region 2000DBA6000: Quantitative Business Research Methods 20
- 21. Part 1. Research designWe do not directly compare: • private hospitals/nursing homes with other industry groups; • Central West NSW region with other geographical regions.Instead, we compare the change between 2000 and 2001 in each industry group and each geo-graphical region.How to interpret the results. . . • If all 2001 groups are different from the 2000 groups after taking into account all drivers, then it is likely there are changes between years not reﬂected in the drivers. We won’t be able to attribute any changes to the IMP. • If all IMP 2001 groups are different from the 2000 groups after taking into account all drivers, but the non-IMP 2001 groups are not different from the 2000 groups, then it is likely the changes between years are due to the IMP.DBA6000: Quantitative Business Research Methods 21
- 22. Part 1. Research designNeedlestick injuriesYou are interested in the number and severity of needle stick injuries amongst health workersinvolved in blood donation and transfusion. Work in groups of three to carefully deﬁne theobjectives of your survey. You will need to specify • the objective of the survey • what data are to be collected • the target population • the survey population • the sample • the data collection method • potential errors which could occur in your survey.Palliative care referralsA few years ago, I helped the Health Department with a survey on palliative care. As partof the study, it was necessary to study the ‘referral’ pattern for palliative care providers: howmany patients they send to hospital (for inpatient or outpatient treatment); how many theyrefer to consultants for specialist comment; how many to community health programs; and soon.Possible sampling schemes: 1. sample a group of palliative care practitioners and study their referral patterns; 2. sample a group of palliative care patients and study their referral patterns.Discuss the possible advantages and disadvantages of the two schemes.DBA6000: Quantitative Business Research Methods 22
- 23. CHAPTER 2 Data collection2.1 Introduction “You don’t have to eat the whole ox to know that the meat is tough.” Samuel JohnsonSampling is very familiar to all of us, because we often reach conclusions about phenomenaon the basis of a sample of such phenomena. You may test a swimming pool’s temperature bydipping your toe in the water or the performance of a new vehicle by a short test drive. Theseare among the countless small samples that we rely on when making personal decisions. Wetend to use haphazard methods in picking our sample and risk substantial sampling error.Research also usually reaches its conclusions on the basis of sampling, but the methods usedmust adhere to certain rules that are going to be discussed. The goal in obtaining data throughsurvey sampling is to use a sample to make precise inferences about the target population. Wewant to be highly conﬁdent about our inferences. It is important to have a substantial graspof sampling theory to appraise the reliability and validity of the conclusions drawn from thesample taken.2.2 Data collecting instrumentsThe choice of data collection instrument is crucial to the success of the survey. When deter-mining an appropriate data collection method, many factors need to be taken into account,including complexity or sensitivity of the topic, response rate required, time or money avail-able for the survey and the population that is to be targeted. Some of the most common datacollection methods are described in the following sections. 23
- 24. Part 2. Data collection2.2.1 Interviewer enumerated surveysInterviewer enumerated surveys involve a trained interviewer going to the potential respon-dent, asking the questions and recording the responses.The advantages of using this methodology are: • provides better data quality • special questioning techniques can be used • greater rapport established with the respondent • allows more complex issues to be included • produces higher response rates • more ﬂexibility in explaining things to respondents • greater success in dealing with language problemsThe disadvantages of using this methodology are: • expensive to conduct • training for interviewers is required • more intrusive for the respondent • interviewer bias may become a source of error2.2.2 Web surveysWeb surveys are increasingly popular, although care must be taken to avoid sample selectionbias and multiple responses from an individual.The advantages of this methodology are: • cheap to administer • private and conﬁdential • easy to use conditional questions and to prompt if no response or inappropriate response. • can build in live checking. • can provide multiple language versionsThe disadvantages of this methodology are: • respondent bias may become a source of error • not everyone has access to the internet • language and interface must be very simple • cannot build up a rapport with respondents • resolution of queries is difﬁcult • only appropriate when straight forward data can be collected2.2.3 Mail surveysSelf-enumeration mail surveys are where the questionnaire is left with the respondent to com-plete.The advantages of this methodology are:DBA6000: Quantitative Business Research Methods 24
- 25. Part 2. Data collection • cheaper to administer • more private and conﬁdential • in some cases does not require interviewersThe disadvantages of this methodology are: • difﬁcult to follow-up non-response • respondent bias may become a source of error • response rates are much lower • language must be very simple • problems with poor English and literacy skills • cannot build up a rapport with respondents • resolution of queries is difﬁcult • only appropriate when straight forward data can be collected2.2.4 Telephone surveysA telephone survey is the process where a potential respondent is phoned and asked the surveyquestions over the phone.The advantages of this methodology are: • cheap to administer • convenient for interviewers and respondentsThe disadvantages of this methodology are: • interviews easily terminated by respondent • cannot use prompt cards to provide alternatives for answers • burden placed on interviewers and respondents • biased sample through households with phones2.2.5 DiariesDiaries can be used as a format for a survey. In these surveys respondents are directed to recordthe required information over a predetermined period in the diary, book or booklet supplied.The advantages of this methodology are: • high quality and detailed data from the completed diaries • more private and conﬁdential circumstances for the respondent • does not require interviewersThe disadvantages of this methodology are: • response rates are lower and the diaries are rarely completed well • language must be simple • can only include relatively simple concepts • cannot build up a rapport • cannot explain the purpose of survey items to respondentsDBA6000: Quantitative Business Research Methods 25
- 26. Part 2. Data collection Face-to-face Telephone Mail Response rates Good Good Good Representative samples Avoidance or refusal bias Good Good Poor Control over who completes the questionnaire Good Good Satisfactory Gaining access to the selected person Satisfactory Good Good Locating the selected person Satisfactory Good Good Effects on questionnaire design Ability to handle: Long questionnaires Good Satisfactory Satisfactory Complex questions Good Poor Satisfactory Boring questions Good Satisfactory Poor Item non-response Good Good Satisfactory Filter questions Good Good Satisfactory Question sequence control Good Good Poor Open ended questions Good Good Poor Quality of answers Minimize socially desirable responses Poor Satisfactory Good Ability to avoid distortion due to Interviewer characteristics Poor Satisfactory Good Interviewer opinions Satisfactory Satisfactory Good Inﬂuence of other people Satisfactory Good Poor Allows opportunities to consult Satisfactory Poor Good Avoids subversion Poor Satisfactory Good Implementing the survey Ease of ﬁnding suitable staff Poor Good Good Speed Poor Good Satisfactory Cost Poor Satisfactory GoodTable 2.1: Advantages and disadvantages of three methods of data collection. Table taken from de Vaus(2001) who adapted it from Dillman (1978).2.2.6 Ideas for increasing response rates 1. Provide reward 2. Systematic follow up 3. Keep it short. 4. Interesting topic.DBA6000: Quantitative Business Research Methods 26
- 27. Part 2. Data collection2.2.7 Archival dataRather than collecting your own data, you may use some existing data. If you do, keep thefollowing points in mind.Available information Is there sufﬁcient documentation of the original research proposal for which the data were collected? If not, there may be hidden problems in re-using the data.Geographical area Are the data relevant to the geographical area you are studying? e.g., what country, city, state or other area does the archive data cover?Time period Are the data relevant to the time period you are studying? Does your research area cover recent events, or is it historical or does it look at changes over a speciﬁed range of time? Most data are at least a year old before they are released to the public.Population What population do you wish to study? This can refer to a group or groups of people, particular events, ofﬁcial records, etc. In addition you should consider whether you will look at a speciﬁc sample or subset of people, events, records, etc.Context Does the archival data contain the information relevant to your research area?2.3 Errors in statistical dataIn sample surveys there are two types of error that can occur: • sampling error which arises as only a part of the population is used to represent the whole population and; • non-sampling error which can occur at any stage of a sample survey.It is important to be aware of these errors so that they can be minimized.2.3.1 Sampling errorSampling error is the error we make in selecting samples that are not representative of thepopulation. Since it is practically impossible for a smaller segment of a population to be exactlyrepresentative of the population, some degree of sampling error will be present whenever weselect a sample. It is important to consider sampling error when publishing survey results asit gives an indication of the accuracy of the estimate and therefore reﬂects the importance thatcan be placed on interpretations.If sampling principles are carefully applied within the constraints of available resources, sam-pling error can be accurately measured and kept to a minimum. Sampling error is affectedby: • sample size • variability within the population • sampling schemeDBA6000: Quantitative Business Research Methods 27
- 28. Part 2. Data collectionGenerally larger sample sizes decrease sampling error. To halve the sampling error the samplesize has to be increased fourfold. In fact, sampling error can be completely eliminated byincreasing the sample size to include every element in the population.The population variability also affects the error, more variable populations give rise to largererrors as the samples or estimates calculated from different samples are more likely to havegreater variation. The effect of the variability within the population can be reduced by increas-ing sample size to make it more representative of the target population.2.3.2 Non-sampling errorNon-sampling error can be deﬁned as those errors in a survey that are not sampling errors.Non-sampling error is any error not caused by the fact that we have only selected part ofthe population in the survey. Even if we were to undertake a complete enumeration of thepopulation, non-sampling errors might remain. In fact, as the size of the sample increases, thenon-sampling errors may get larger, because of such factors as possible increase in the responserate, interviewer errors, and data processing errors.For the most part we cannot measure the effect that non-sampling errors will have on the re-sults. Because of their nature, these errors may not be totally eliminated. Perhaps the biggestsource of non-sampling error is a poorly designed questionnaire. The questionnaire can in-ﬂuence the response rate achieved in the survey, the quality of responses obtained and conse-quently the conclusions drawn from survey results.Some common sources of non-sampling error are discussed in the following paragraphs.Target Population Failure to identify clearly who is to be surveyed. This can result in an inadequate sam- pling frame; imprecise deﬁnitions of concepts and poor coverage rules.Non-response A non-response error occurs when the respondents do not reﬂect the sampling frame. This could occur when the people who do not respond to the survey differ to the people who did respond to the survey. This often occurs in voluntary response polls. For ex- ample, suppose that in an air bag study we asked respondents to call a 0018 number to be interviewed. Because a 0018 call cost $2 per minute, many drivers may not respond. Furthermore, those who do respond may be the people who have had bad experiences with air bags. Thus the ﬁnal sample of respondents may not even represent the sampling frame. For example, • telephone polls miss those people without phones • household surveys miss homeless, prisoners, students in colleges, etc. • train surveys only target public transport users and tend to include regular public transport users.DBA6000: Quantitative Business Research Methods 28
- 29. Part 2. Data collection Manufacturers and advertising agencies often use interviews at shopping malls to gather information about the habits of consumers and the effectiveness of ads. A sample of mall shoppers is fast and cheap. “Mall interviewing is being propelled primarily as a budget issue”, one expert told the New York Times. But people con- tacted at shopping malls are not representative of the entire population. They are richer, for example, and more likely to be teenagers or retired. Moreover, mall inter- viewers tend to select neat safe looking individuals from the stream of customers. Decisions based on mall interviews may not reﬂect the preferences of all consumers. In 1991 it was claimed that data showed that right-handed persons live on average almost a decade longer than left-handed or ambidextrous persons. The investigators had compared mean ages at death of people who appeared to be survivors as left, right or mixed handed. • What is the problem?The questionnaire Poorly designed questionnaires with mistakes in wording, content or layout may make it difﬁcult to record accurate answers. The most effective methods of designing a question- naire are discussed in Section 2.4. If these principles are followed it will help reduce the non-sampling error associated with the questionnaire.Interviewers If an interviewer is used to administer the survey, their work has the potential to produce non-sampling error. This can be due to the personal characteristics of the interviewer. For example, an elderly person will often be more comfortable giving information to a female interviewer. Other factors which could cause error are the interviewer’s opinions and characteristics which may inﬂuence the respondent’s answers. In 1968, one year after a major racial disturbance in Detroit, a sample of black resi- dents was asked: Do you personally feel that you can trust most white people, some white people, or none at all? Of those interviewed by whites, 35% answered “Most”, while only 7% of those in- terviewed by blacks gave this answer. Many questions were asked in this study. Only on some topics, particularly black-white trust or hostility, did the race of the interviewer have a strong effect on the answers given. The interviewer was a large source of non-sample error in this study.Respondents Respondents can also be a source of non-sampling error. They may refuse to answer ques- tions, or provide inaccurate information to protect themselves. They may have memory lapses and/or lack of motivation to answer the questionnaire, particularly if the ques- tionnaire is lengthy, overly complicated or of a sensitive nature. Respondent fatigue is a very important factor. Social desirability bias refers to the effect where respondents will provide answers which they think are more acceptable, or which they think the interviewer wants to hear. For example, respondents may state that they have a higher income than is actually the case if they feel this will increase their status.DBA6000: Quantitative Business Research Methods 29
- 30. Part 2. Data collection Respondents may refuse to answer a question which they ﬁnd embarrassing or choose a response which prevents them from continuing with the questions. For example, if asked the question: “Are you taking oral contraceptive pills for any reason?”, and know- ing that if they respond “Yes” they will be asked for more details, respondents who are embarrassed by the question are likely to answer “No”, even if this is incorrect. Fatigue can be a problem in surveys which require a high level of commitment for respon- dents. The level of accuracy and detail supplied may decrease as respondents become tired of recording all information. Sometimes interviewer fatigue can also be a problem, particularly when the interviewers have a large number of interviews to conduct.Processing and collection Processing and collection errors can be a source of non-sampling error. For example, the results from the survey may be entered incorrectly . The time of year the survey is enumerated can produce non-sampling error. For example, if the survey is conducted in the school holidays, potential respondents with school children could possibly be away or hard to contact.The Shere Hite surveysIn 1987, Shere Hite published a best-selling book called Women and Love. The author distributed100,000 questionnaires through various women’s groups, asking questions about love, sex, andrelations between women and men. She based her book on the 4.5% of questionnaires that werereturned. • 95% said they were unhappily married • 91% of those who were divorced said that they had initiated the divorceWhat are the problems with this research? Exercise 1: In Case 2, it was necessary to study the ‘referral’ pattern for palliative care providers: how many patients they send to hospital (for inpatient or out- patient treatment); how many they refer to consultants for specialist comment; how many to community health programs; and so on. Two alternative sam- pling schemes are available: sample a group of palliative care practitioners and study their referral patterns; or sample a group of palliative care patients and study their referral patterns. Discuss the possible advantages and disad- vantages of the two schemes.2.4 Questionnaire design2.4.1 IntroductionThe purpose of a questionnaire is to obtain speciﬁc information with tolerable accuracy andcompleteness. Before the questionnaire is designed, the collection objectives should be deﬁned.These include:DBA6000: Quantitative Business Research Methods 30
- 31. Part 2. Data collection • clarifying the objectives of the survey • determining who is to be interviewed • deﬁning the content • justifying the content • prioritizing the data that are to be collected. This is important as it makes it easier to discard items if the survey, once developed, is too lengthy.Careful consideration should be given to the content, wording and format of the questionnaireas one of the largest sources of non-sampling error is poor questionnaire design. This error canbe minimized by considering the objectives of the survey and the required output, and thendevising a list of questions that will accurately obtain the information required.2.4.2 Content of the questionnaireRelevant questionsIt is important to ask only questions that are directly related to the objectives of a survey as ameans of minimizing the burden place on respondents. The concept of a fatigue point, which oc-curs when respondents can no longer be bothered answering questions, should be recognized,and questions designed so that the respondent is through the form before this point is reached.Towards the end of long questionnaires, respondents may give less thought to their answersand concentrate less on the instructions and questions, thereby decreasing the accuracy of in-formation they provide. Very long questionnaires can also lead the respondent to refuse tocomplete the questionnaire. Hence it is necessary to ensure only relevant questions are asked.Reliable questionsIt is important to include questions in a questionnaire that can be easily answered. This objec-tive can be achieved by adhering to the following techniques.Appropriate recall If information is requested by recall, the events should be sufﬁciently recent or familiar to respondents. People tend to remember what they should have done, have selective memories, and move into reference period activities which surround the event. Minimizing the need for recall improves the accuracy of response.Common reference periods To make it easier for the respondent to answer, use reference periods which match those of the respondent’s records.Results justify efforts The amount of effort to which a respondent goes to obtain the data must be worth it. It is reasonable to accept a respondent’s estimate when calculating the exact ﬁgures would make little difference to the outcome.Filtering Respondents should not be asked question they cannot answer. Filter questions should be asked to exclude respondents from irrelevant questions.DBA6000: Quantitative Business Research Methods 31
- 32. Part 2. Data collection2.4.3 Types of questionsFactual questions Information is required from these questions rather than an opinion. For example respon- dents could be asked about behaviour patterns (e.g., When did you last visit a General Practitioner?).Classiﬁcation or demographic questions These are used to gain a proﬁle of the population that has been surveyed and provide important data for analysis.Opinion questions Rather than facts, these questions seek opinion. There are many problems associated with opinion questions: • a respondent may not have an opinion/attitude towards the subject so the response may be provided without much thought; • opinion questions are very sensitive to changes in wording; • it is impossible to check the validity of responses to opinion questions.Hypothetical questions The “What would you do if . . . ?” type of question. The problems with these questions are similar to opinion questions. You can never be certain how valid any answer to a hypothetical is likely to be.2.4.4 Answer formatsQuestions can generally be classiﬁed as one of two types, open or closed, depending on theamount of freedom allowed in answering the question. When deciding which type of questionto use, consideration should be given to the kind of information sought, ease of processing theresponse, and the availability of the resources of time, money, and personnel.Open questionsOpen questions allow the respondents to answer the question in their own words. These ques-tions allow as many possible answers and they can collect exact values from a wide range ofpossible values. Hence, open questions are used when the list of responses is very long and notobvious.The major disadvantage of open questions is they are far more demanding than closed ques-tions both to answer and process. These questions are most commonly used where a widerange of responses is expected. Also, the answers to these questions depend on the respon-dents ability to write or speak as much as their knowledge. Two respondents might have thesame knowledge and opinions, but their answers may seem different because of their varyingabilities.DBA6000: Quantitative Business Research Methods 32
- 33. Part 2. Data collection Question Format Which country makes the best cars Open ended ............................................... Which country makes the best cars? Multiple choice questions 1. USA 2. Germany 3. Japan Which country makes the best cars? Partially closed questions 1. USA 2. Germany 3. Japan 4. Other (please specify) For the list provided, indicate which brand/s of Checklist questions cars you have owned? 1. Ford 2. Toyota 3. BMW I believe Japanese cars are less reliable than Likert scale (opinion) questions European cars. Strongly Agree Agree No opinion Disagree Strongly disagree 1 2 3 4 5Closed questionsClosed questions ask the respondents to choose an answer from the alternatives provided.These questions should be used when the full range of responses is known. Closed questionsare far easier to process than open questions. The main disadvantage of closed questions is thereasons behind a particular selection cannot be determined.There are a number of types of closed questions. • Limited choice questions require the respondent to choose one of two mutually exclusive answers. For example yes/no. • Multiple choice questions require the respondent to choose from a number of responses provided. • Checklist questions allow a respondent to choose more than one of the responses pro- vided. • Partially closed questions provide a list of alternatives where the last alternative is “Other, please specify”. These questions are useful when it is difﬁcult to list all possible choices. • Opinion (Likert) scale An opinion scale question seeks to locate a respondent’s opin- ion on a rating scale with a limited number of points. For example, a ﬁve point scale measure of strong and weak attitudes would ask the respondent whether they strongly agree/agree/are neutral/disagree/strongly disagree with a particular statement of opin-DBA6000: Quantitative Business Research Methods 33
- 34. Part 2. Data collection ion. Whereas a three point scale would only measure whether they agree, disagree or are neutral. Opinion scales of this sort are called Likert scales. Five point scales are best because: – – –Response CategoriesWhen questions have categories provided, it is important that every response is catered for.Number of Categories The quality of the data can be inﬂuenced if there are too few categories as the respondent may have difﬁculty ﬁnding one which accurately describes their situation. If there are too many categories the respondent may also have difﬁculty ﬁnding one which accurately describes their situation.Don’t Know A ‘Don’t Know’ category can be included so respondents are not forced to make decisions/attitudes that they would not normally make. Excluding the option is not usu- ally good, however, it is hard to predict the effect of including it. The decision of whether or not to include a ‘Don’t Know’ option depends, to a large extent, on the subject matter. I was gifted to be able to answer promptly, and I did. I said I didn’t know. Mark Twain, Life on the Mountain2.4.5 Wording of questionsLanguageQuestions which employ complex or technical language or jargon can confuse or irritate re-spondents. Respondents who do not understand the question may be unwilling to appearignorant by asking the interviewer to explain the question or if a interviewer is not present,may not answer or answer incorrectly.AmbiguityIf ambiguous words or phrases are included in a question, the meaning may be interpreteddifferently by different people. This will introduce errors in the data since different respondentswill virtually be answering different questions.For example “Why did you ﬂy to New Zealand on Qantas airlines?”. Most might interpretthis question as was intended, but it contains three possible questions, so the response mightconcern any of these: • I ﬂew (rather than another mode of travel) because . . . • I went to New Zealand because . . . • I selected Qantas because . . .DBA6000: Quantitative Business Research Methods 34
- 35. Part 2. Data collectionDouble-barreled questionsWhen one question contains two concepts, it is known as a double-barreled question. Forexample , “How often do you go grocery shopping and do you enjoy it?”.Each concept in the question may have a different answer, or one concept may not be relevant,respondents may be unsure how to respond. The interpretation of the answers to these ques-tions is almost impossible. Double-barreled questions should be split into two or more separatequestions.Leading questionsQuestions which lead respondents to answers can introduce error. For example, the question“How many days did you work last week?”, if asked without ﬁrst determining whether re-spondents did in fact take work in the previous week, is a leading question. It implies thatthe person would have been at work. Respondents may answer incorrectly to avoid telling theinterviewer that they were not working.Unbalanced questions“Are you in favour of euthanasia?” is an unbalanced question because is provides only one al-ternative. It can be reworded to ‘Do you favour or not favour euthanasia?’, to give respondentsmore than one alternative.Similarly, the use of a persuasive tone can affect the respondent’s answers. Wording should bechosen carefully to avoid a tone that may produce bias in responses.Recall/memory errorRespondents tend to remember what should have been done rather that what was done. Thequality of data collected from recall questions is inﬂuenced by the importance of the event tothe respondent and the length of time since the event took place. Subjects of greater interest orimportance to the respondent, or events which happen infrequently, will be remembered overlonger periods and more accurately. Minimizing the recall period also helps to reduce memorybias.Telescoping is a speciﬁc type of memory error. This occurs if the respondent reports eventsas occurring either earlier or later than they actually occur. Error occurs when respondentsincluded details of an event which actually occurred outside the speciﬁed reference period.Sensitive questionsQuestions on topics which respondents may see as embarrassing or highly sensitive can pro-duce inaccurate answers. If respondents are required to answer questions with informationthat might seem socially undesirable, they may provide the interviewer with responses theybelieve are more ‘acceptable’. If placed at the being of the questionnaire, it could lead to non-response if respondents are unwilling to continue with the remaining questions.For example, “Approximately how many cans of beer do you consume each week, on aver-age?” 1. NoneDBA6000: Quantitative Business Research Methods 35
- 36. Part 2. Data collection 2. 1–3 cans 3. 4–6 cans 4. More than 6A respondent might answer response 2 or 3 rather than admit to consuming the greatest quan-tity on the scale. Consider extending the range of choices far beyond what is expected. Therespondent can select an answer closer to the middle and feel more in the normal range. In 1980, the New York Times CBS News Poll asked a random sample of Americans about abortion. When asked “Do you think there should be an amendment to the Constitution prohibiting abortions, or should not there be such an amendment?” 29% were in favour and 62% were opposed. The rest of the sample were uncer- tain. The same people were later asked a different question: “Do you believe there should be an amendment to the Constitution protecting the life of the unborn child, or should not there be such an amendment?” Now 50% were in favour and only 39% were opposed.AcquiescenceThis situation arises when there is a long series of questions for which respondents answerwith the same response category. Respondents get used to providing the same answer andmay answer inaccurately.2.4.6 Questionnaire formatIncluding an introductionIt can be advantageous to include an introductory statement or explanation at the beginning ofa survey. The introduction may included such information as the purpose of the survey or thescope of collection. It will aid the respondent when answering the questions if they know whythe information is being sought. The respondent should be given a context in which to framehis or her answers. An assurance of conﬁdentiality will provide respondents with conﬁdencethat the results will not be obtained by unwanted parties.Question and page numbersTo ensure that the questionnaire can be easily administered by interviewer or respondents, thepages of the questionnaire and the questions should be number consecutively with a simplenumbering system. Question numbering is a way of providing sign-posts along the way. Theyhelp if remedial action is required later, and you want to refer the interviewer or respondentback to a particular place.SequencingThe questions in a questionnaire should follow an order which is logical and smoothly ﬂowsfrom one question to the next. The questionnaire layout should have the following character-istics.DBA6000: Quantitative Business Research Methods 36
- 37. Part 2. Data collectionRelated questions grouped Questions which are related should be grouped together and where necessary placed into sections. Sections should contain an introductory heading or statement. If possible, question ordering should try and anticipate the order in which respondents will supply information. It shows good survey design if a question not only prompts an answer but also prompts an answer to a question following shortly.Question ordering It is important to be aware that earlier questions can inﬂuence the responses of later ques- tions, so the order of questions should be carefully decided. In attitudinal questions, it is important to avoid conditioning respondents in an early question which could then bias their responses to later questions. For example, you should ask about awareness of a concept before any other mention of the concept.Respondent motivationWhenever possible, start the questionnaire with easy and pleasant questions to promote inter-est in the survey and give the respondent conﬁdence in their ability to complete the survey.The opening questions should ensure that the particular respondent is a member of the surveypopulation.Questions that are perceived as irritating or obtrusive tend to get a low response rate andmay effectively trigger a refusal from the respondent. These questions need to be carefullypositioned in a questionnaire where they are least likely to be sensitive.It is also important that respondents are only asked relevant questions. Respondents may be-come annoyed and disinterested if this does not occur. Include ﬁlter questions to direct re-spondents to skip to questions which do not apply to them. Filter questions often identifysub-populations. For example, “Do you usually speak English at home?” Yes (Go to Q34) No (Go to Q10)Questionnaire layoutThe questionnaire layout should be aesthetically pleasing, so the layout does not contribute torespondent fatigue. Things that can interfere with the answering of a questionnaire are: unclearinstructions and questions, insufﬁcient space to provide answers, hard-to-read text, difﬁcultyin understanding language, back-tracking through the form. Many of these things are bad formdesign and are avoidable.Only include essentials on the questionnaire form. Keep the amount of ink on the form to theminimum necessary for the form to work properly. Anything that is not necessary contributesto the fatigue point of the respondent and to the subsequent detriment of the data quality.DBA6000: Quantitative Business Research Methods 37
- 38. Part 2. Data collectionGeneral layoutConsistency of layout: If consistency and logical patterns are introduced into the form design, it eases the form ﬁller’s task. Patterns that can be useful are: • white spaces for responses • using the same question type throughout the form • using the same layout throughout the form • using a different style, consistently, for instructions or directions.Type Size: A font size between 10 and 12 is considered the best in most circumstances. If the respondent does not have perfect vision, or ideal working conditions, small fonts can cause problems.Use of all upper-case text: It is best to avoid upper case text. Upper case text has been shown to be hard to read, especially where large amounts of text are involved. Words lose their shape when in upper case, becoming converted to rectangles. Text in upper case should be left for use for titles or for emphasis but, this can often be done just as well using other methods, such as bold, italics, or slightly larger type size.Line length: As the eye has a clear focus range of only a few degrees, lines should be kept short. It takes the eyeball several eye movements to scan a line of text. If more than 2 or 3 such movement occur then the eye can become fatigued. There is a tendency for the eye to lose track of which line it is reading. This leads to backtracking the text or misinterpretation.Character and line spacing: It is very important to leave enough space on a form for answers. It has been shown in research that forms requiring hand written responses need a distance of 7–8mm between lines and a 4–5mm width for each possible character.Response layoutObtaining responses: A popular way of obtaining responses is using tick boxes. However, it is usually preferable to use a labelled list (e.g., a, b, c, . . . ) and ask respondents to circle their response. This makes coding and data entry easier. If a written response is required it is best to provide empty answer spaces, with lines made up of dots.Positioning of responses: Vertical alignment of responses is preferred to horizontal alignment. It is easier to read up and down the list, and select the correct box, than read across the page and locate an item in a horizontal string. Captions to the left of the answer box are easier for respondents to complete.Order of response options: The consideration of the order of responses is important as the order can be a source of bias. The options presented ﬁrst may be selected because they make an impact on respondents or because respondents lose concentration and do not hear or read the remaining options. The last options may be chosen because it was easily recalled, particularly if respondents are faced with a long list of options. Long or complex response options may also make recall more difﬁcult and increase the effects due to the order ofDBA6000: Quantitative Business Research Methods 38
- 39. Part 2. Data collection response options.Prompt card: If the questionnaire is interviewer based, and a number of response options are given for some questions, then a prompt card may be appropriate. A prompt card is a list of possible responses to a question, displayed on a separate card which are shown by the interviewer to assist respondents. This helps to decrease error resulting from respondents being unable to remember all the options read out. However respondents with poor eyesight, migrants with limited English or adults with literacy problems will experience difﬁculties in answering accurately. Exercise 2: (Case 2) The questionnaire on pages 47–48 was an early draft of the questionnaire prepared by the client. The questionnaire on pages 49–51 is a later draft of the questionnaire after I had provided the client with some advice. See if you can determine why each of the changes has been made. How could you further improve the questionnaire?2.4.7 Pretesting the questionnaireA pretest of a questionnaire should be considered mandatory. Although the designer of thequestionnaire would have reviewed the drafted questionnaire meticulously on all points ofgood design, it is still likely to contain faults. Normally, a number of these emerge when theform is used in the ﬁeld, because the researcher did not completely anticipate what would takeplace. The only way that these faults may be fully detected is by actually administering thesurvey with the types of respondents who would be sampled in the study.Each type of testing is used at a different stage of survey development and aims to test differentaspects of the survey.Skirmishing Skirmishing is the process of informally testing questionnaire design with groups of re- spondents. The questionnaire is basically unstructured and is tested with a group of people who can provide feedback on issues such as each question’s frame of reference, the level of knowledge needed to answer the questions, the range of likely answers to questions and how answers are formulated by respondents. Skirmishing is also used to detect ﬂaws or awkward wording of questionnaires as well as testing alternative designs. At this stage we may use open-ended response categories to work-out likely responses. The questionnaire should be redrafted after skirmishing.Focus groups A skirmish tests the questionnaire design against general respondents whilst focus groups concentrate on a speciﬁc audience. For example, a survey studying the effects of living on unemployment beneﬁts could have a group of unemployed people as a focus group. A focus group can be used to test questions directed at small sub-populations. For ex- ample if we were looking at community services we may have a ﬁlter question to target disabled people. Since there may not be many disabled chosen in the sample, we need to test the questions on a focus group of disabled people, which is a biased sample.DBA6000: Quantitative Business Research Methods 39
- 40. Part 2. Data collectionObservational studies Respondents complete a draft questionnaire in the presence of an observer during an observational study. Whilst completing the form the respondents explain their under- standing of the questions and the method required in providing the information. These studies can be a means of identifying problem questions through observations, questions asked by the respondents, or the time taken to complete a particular question. Data avail- ability and the most appropriate person to supply the information can also be gauged through observational studies. The form is being tested and not the respondent and this should be stressed to the respondent.Pilot testing Pilot testing involves formally testing a questionnaire or survey with a small represen- tative sample of respondents. Semi-closed questions are usually used in pilot testing to gather a range of likely responses which are used to develop a more highly structured questionnaire with closed questions. Pilot testing is used to identify any problems asso- ciated with the form, such as questionnaire format, length, question wording and allows comparison of alternative versions of a questionnaire.2.5 Data processingData processing involves translating the answers on a questionnaire into a form that can bemanipulated to produce statistics. In general, this involves coding, editing, data entry, andmonitoring the whole data processing procedure. The main aim of checking the various stagesof data processing is to produce a ﬁle of data that is as error free as possible.2.5.1 Data codingUp to this point, the questionnaire has been considered mainly as a means of communicationwith the respondent. Just as important, the questionnaire is a working document for the trans-fer of data on to a computer ﬁle. Consequently it is important to design the questionnaire tofacilitate data entry.Unless all the questions on a questionnaire are “closed” questions, some degree of coding isrequired before the survey data can be sent for punching. The appropriate codes should be de-vised before the questionnaires are processed, and are usually based on the results of pretesting.Coding consists of labelling the responses to questions (using numerical or alphabetic codes) inorder to facilitate data entry and manipulation. Codes should be formulated to be simple andeasy. For example if Question 1 has four responses then those four responses could be giventhe codes a, b, c, and d. The advantage of coding is the simplistic storage of data as a few-digitcode compared to lengthy alphabetical descriptions which almost certainly will not be easy tocategorize.Coding is relatively expensive in terms of resource effort. However, improvements are alwaysbeing sought by developing automated techniques to cover this task. Other options include theuse of self coding where respondents answer the appropriate code or the interviewer performsDBA6000: Quantitative Business Research Methods 40
- 41. Part 2. Data collectionthe coding task.Before the interviewing begins, the coding frame for most questions can be devised. That is, thelikely responses are obvious from previous similar surveys or thorough pilot testing, allowingthose responses and relevant codes to be printed on the questionnaire. An “Other (PleaseSpecify)” answer code is often added to the end of a question with space for interviewers towrite the answer. The standard instruction to interviewers in doubt about any precodes is thatthey should write the answers on the questionnaire in full so that they can be dealt with by acoder later.2.5.2 Data entryEnsure that the questionnaire is designed so data entry personnel have minimal handling ofpages. For example, all codes should be on the left (or right) hand side of the page. It isadvisable to use trained data entry people to enter the data. It is quicker and more reliable andtherefore more cost effective.2.6 Sampling schemesWhen you have a clear idea of the aims of the survey and the data requirements, the degree ofaccuracy required, and have considered the resources and time available, you are in a positionto make a decision about the size and the form of collection of sampling units.The two qualities most desired in a sample (besides that of providing the appropriate ﬁndings),are its representativeness and stability. Sample units may be selected in a variety of ways. Thesampling schemes fall into two general types: probability and non-probability methods.2.6.1 Non-probability samplesIf the probability of selection for each unit is unknown, or cannot be calculated, the sample iscalled a non-probability sample. For non-probability samples, since there is no control over rep-resentativeness of the sample, it is not possible to accurately evaluate the precision of estimates(i.e., closeness of estimates under repeated sampling of the same size). However, where timeand ﬁnancial constraints make probability sampling infeasible, or where knowing the level ofaccuracy in the results is not an important consideration, non-probability samples do have arole to play. Non-probability samples are inexpensive, easy to run and no frame is required.This form of sampling is popular amongst market researchers and political pollsters as a lot oftheir surveys are based on a pre-determined sample of respondents of certain categories.One common method of non-probability sampling is voluntary response polling. A generalappeal is made (often via television) for people to contact the researcher with their opinion.Voluntary response samples are rarely useful because they over-represent people with strongopinions, most often negative opinion.DBA6000: Quantitative Business Research Methods 41
- 42. Part 2. Data collection2.6.2 Probability sampling schemesProbability sampling schemes are those in which the population elements have a known chanceof being selected for inclusion in a sample. Probability sampling rigorously adheres to a pre-cisely speciﬁed system that permits no arbitrary or biased selection. There are four main typesof probability sampling schemes.Simple Random Sample: If a sample size of size n is drawn from a population of size N in such a way that every possible sample of size n has the sample chance of being selected, the sampling procedure is called simple random sampling. The sample thus obtained is called a simple random sample. This is the simplest form of probability sample to analyse.Stratiﬁed Random Sample: A stratiﬁed random sample is one obtained by separating the pop- ulation elements into non-overlapping groups, called strata, and then selecting a simple random sample from each stratum. This can be useful when a population is naturally divided into several groups. If the results on each stratum vary greatly, then it is possi- ble to obtain more efﬁcient estimators (and therefore more precise results) than would be possible without stratiﬁcation.Systematic Sample: A sample obtained by randomly selecting one element from the ﬁrst k el- ements in the frame and every kth element thereafter is called a 1-in-k systematic sample, with a random start. This is obviously a simple method if there is a list of elements in the frame. Systematic sampling will provide better results than simple random sampling when the systematic sample has larger variance than the population. This can occur when the frame is ordered.Cluster Sample: A cluster sample is a probability sample in which each sampling unit is a collection, or cluster, of elements. The population is divided into clusters and one or more of the clusters is chosen at random and sampled. Sometimes the entire cluster is sampled; on other occasions a simple random sample of the chosen clusters is taken. Cluster sampling is usually done for administrative convenience, and is especially useful if the population has a hierarchical structure.A comparison of these four sampling schemes appears in the table on the following page. Example (Case 2): A few years ago, I advised the Department of Health and Com- munity Services on a survey of palliative care patients in Victoria. Objective: To estimate the proportion of palliative care patients in Vic- torian hospitals. Difﬁculties: What is a “palliative care patient”? Proportion of what? Target population: Patients in acute beds at the time of the survey? Survey population: All patients in acute beds in Victorian hospitals except for very small (< 10 bed) country hospitals. Sampling scheme: Stratiﬁed (hospital types) and clustered (hospitals). Ran- dom selection of hospitals within each strata. Total cover- age of patients in the selected hospitals. Sample: All patients in the 18 hospitals selected out of 115 hospitals in Victoria.DBA6000: Quantitative Business Research Methods 42
- 43. Part 2. Data collection Scheme How to select sample Strengths/Weaknesses Simple Random Assign numbers to elements • The basic building block Sample in sampling. Use a random • Simple, but often costly. number table or random • Cannot use unless we can number generator to select assign a number to each sample. element in a target population. Stratiﬁed Sample Divide population into • With proper strata, can groups that are similar produce very accurate within and different between estimates. on the variable of interest. • Less costly than simple Use random numbers to random sampling. select the sample from each • Must stratify target stratum. population correctly. Systematic Sample Select every kth element • Produces very accurate from a list after a random estimates when elements start. in a population exhibit order. • Used when simple random or stratiﬁed sampling is impractical: e.g., the population size is not known. • Simpliﬁes the selection process. • Do not use with periodic populations. Cluster Sample Randomly choose clusters • With proper clusters, can and sample all elements produce very accurate within each cluster. estimates. • Useful when sampling frame unavailable or travel costs high. • Must cluster target population correctly.DBA6000: Quantitative Business Research Methods 43
- 44. Part 2. Data collection Exercise 3: Consider the four cases listed in the Appendix. What sampling scheme was used in each case? Why were these schemes used?2.7 Scale developmentWith Likert scale data, it is common to construct a new numerical variable by summing thevalues of questions on a related topic (treating the answers as numerical scores from 1–5). Thisforms a “measure” or “scale” for the underlying “construct”.More sophisticated means of deriving scales are possible. One common approach is to useFactor Analysis (discussed in Section 8.1).2.7.1 ValidityA valid measure is measuring the thing it is intended to measure.E XAMPLE : • A study compares job-satisfaction of people over time and ﬁnds it is declining. Does that mean poor management is leading to declining job satisfaction? • How would you construct a valid study which enables the measurement of the effect of management on job satisfaction? • How do you measure workplace harmony? Is frequency of arguments a valid measure? • Are the results of a study in your company generalizable to other companies? • How would you construct a valid study of this issue which applies to other companies?2.7.2 ReliabilityA reliable measure is one that gives the same ‘reading’ when used in repeated occasions. • A measure is reliable but not valid if it is consistently wrong. e.g., survey on alcohol intake. • A measure is valid but unreliable if it sometimes measures the thing of interest, but not always. e.g., survey on sexual experience.DBA6000: Quantitative Business Research Methods 44
- 45. Part 2. Data collectionAppendix B: Case studiesCase 1: Saulwick PollThis appeared in The Age, 1 January 1990.DBA6000: Quantitative Business Research Methods 45
- 46. Part 2. Data collectionCase 2: Palliative care patients surveyThis survey was designed to estimate the number of palliative care patients in Victorian hos-pitals. A palliative care patient was deﬁned as a patient who was terminally ill and whoselife expectancy was less than 6 months. The Department of Health and Community Servicesdid not know how many patients were in this category, but a previous survey in anotherstate indicated the proportion might be about 12%. The hospitals in Victoria were dividedinto eight groups: metropolitan teaching, metropolitan large non-teaching, metropolitan smallnon-teaching, country base, large country, small country, metropolitan extended care, countryextended care. These eight hospital types included 115 Victorian public hospitals.Within each group of hospitals, one or more were selected at random for the sample. Eighteenhospitals in total were sampled. For each hospital surveyed, the number of palliative care pa-tients was recorded. From this information, the proportion of hospital patients in Victoria whocould be classiﬁed as “palliative care” patients was estimated. The ﬁnal estimated proportionwas about 4.5%.DBA6000: Quantitative Business Research Methods 46
- 47. Part 2. Data collectionDBA6000: Quantitative Business Research Methods 47
- 48. Part 2. Data collectionDBA6000: Quantitative Business Research Methods 48
- 49. Part 2. Data collectionDBA6000: Quantitative Business Research Methods 49
- 50. Part 2. Data collectionDBA6000: Quantitative Business Research Methods 50
- 51. Part 2. Data collectionDBA6000: Quantitative Business Research Methods 51
- 52. Part 2. Data collectionCase 3: Survey of frequency of needlestick injuriesThis survey was conducted by a company who had designed and marketed health safety prod-ucts including needle protectors. As part of their marketing, they were interested in the fre-quency and severity of needlestick injuries amongst health workers. The survey was conductedin seven Australian cities over a one week period. The sample consisted of 56 staff members ofthe Red Cross Transfusion Services and 136 nursing staff in 25 Australian haemodialysis units.All staff who worked during the survey week were included in the sample. Each ﬁlled in aquestionnaire.Case 4: Church “Life Survey” of members’ opinionsThe Catholic Church Life Survey is a collection of 25 separate questionnaires designed to collectinformation about the opinions and characteristics of the Catholic church’s clergy and mem-bership. Each diocese in Australia was surveyed. Within each diocese there are both urban andrural parishes. A sample of urban parishes was surveyed and a sample of rural parishes wassurveyed within each diocese. For those parishes surveyed, a random sample consisting of 2/3of the members who attended on the day of the survey completed the main questionnaire.DBA6000: Quantitative Business Research Methods 52
- 53. CHAPTER 3 Data summary Recall: Types of data The ways of organizing, displaying and analysing data depends on the type of data we are investigating. • Categorical Data (also called nominal or qualitative) e.g. sex, race, type of business, postcode Averages don’t make sense. Or- dered categories are called ordinal data • Numerical Data (also called scale, interval and ratio) e.g. income, test score, age, weight, temperature, time. Averages make sense. Note that we sometimes treat numerical data as categories. (e.g. three age groups.) 53
- 54. Part 3. Data summary3.1 Summarising categorical data3.1.1 Percentages and frequency tables Example: Causes of death Deaths in 1979 for 20–25 year olds in Australia. Cause Males Females Totals Percentage Motor vehicle accidents 540 132 672 47.5% All other accidents 197 43 240 16.9% Suicide 149 48 197 13.9% Diseases 78 52 130 9.2% Neoplasms 48 36 84 5.9% All other causes 56 37 93 6.6% Totals 1068 348 1416 100.0% This is a contingency table or frequency table or two-way table.3.1.2 Bar chartsPie chart shows proportion of observations in each category by angle of each segment; quite poor at communicating the information.Bar chart shows number of observations in each category by length of each bar. Much easier to see differences.For one categorical variable: use a bar chart:DBA6000: Quantitative Business Research Methods 54
- 55. Part 3. Data summary Motor vehicle accidents All other accidents Diseases All other causes Suicide All other accidents Diseases NeoplasmsMotor vehicle accidents Suicide All other causes Neoplasms 0 10 20 30 40 • It is harder to make comparisons with the pie chart • It is harder to estimate percentages with the pie chart • Labelling is messier with the pie chart • The pie chart shows “parts of a whole” better3.1.3 Barcharts with two variables Sex by cause of death Motor vehicle accidents All other accidents female Suicide Diseases Neoplasms All other causes male 0 200 400 600 800 1000 Cause of death by sex All other causes Neoplasms Diseases Male Female Suicide All other accidents Motor vehicle accidents 0 100 200 300 400 500 600DBA6000: Quantitative Business Research Methods 55
- 56. Part 3. Data summary Cause of death by sex Motor vehicle accidents All other accidents Suicide Diseases Female Male Neoplasms All other causes 0 100 200 300 400 5003.2 Summarizing numerical data3.2.1 Percentiles • Example: the 90th percentile is the point where 90% of data lie below that point and 10% of data lie above that point. • The median is the 50th percentile. It is sometimes labelled Q2. • The median is the middle measurement when the measurements are arranged in order. If there are an even number of measurements, it is the average of the middle two. • The quartiles are the 25th and 75th percentiles. They are often labelled Q1 and Q3. • The interquartile range is Q3−Q1. Example: Letter recognition scores Scores from letter recognition test conducted on 30 six-year-old girls. 0 0 0 0 0 0 1 1 1 2 2 2 3 3 3 3 Percentile 25% 50% 75% 3 4 4 4 4 5 5 6 1 3 5.1 7 7 8 12 13 203.2.2 Five number summary Minimum Q1 Median Q3 Maximum 0 1 3 5.1 20 • 25% of data are between min and Q1 • 25% of data are between Q1 and median • 25% of data are between median and Q3 • 25% of data are between Q3 and maximumDBA6000: Quantitative Business Research Methods 56
- 57. Part 3. Data summary3.2.3 OutliersOne deﬁnition of an outlier: Any point which is more than 1.5(IQR) above Q3 or more than 1.5(IQR)below Q1. Don’t delete outliers. Investigate them! Example: Letter recognition scores For letter recognition data: Q1 = 1, Q3 = 5.1, IQR = 5.1 - 1 = 4.1. • So observations above 5.1 + 1.5(4.1) = 11.2 are outliers. • Observations below 1 - 1.5(4.1) = -5.15 are outliers. Of course, this can’t happen. Example: Air accidents Number of airline accidents for 17 Asian airlines for 1985–1994. Source: Newsday (1995). Accidents Airline 0 Air India (India) 0 Air Nippon (Japan) 0 All Nippon (Japan) 1 Asiana (South Korea) 0 Cathay Paciﬁc (Hong Kong) 1 Garuda (Indonesia) 5 Indian Airlines (India) 1 Japan Airlines (Japan) 0 Japan Air System (Japan) 1 Korean Air Lines (South Korea) 0 Malysia Air Lines (Malaysia) 10 Merpati (Indonesia) 0 Niu Guini (Papua New Guinea) 3 Philippine Air Lines (Philippines) 3 PIA (Pakistan) 0 SIA (Singapore) 1 Thai Airways (Thailand) Percentile 25% 50% 75% .... .... .... IQR = Outliers:DBA6000: Quantitative Business Research Methods 57
- 58. Part 3. Data summary3.2.4 BoxplotsGraphical representationof ﬁve number summary. Outlier Maximum when outliers omitted Q3 (upper quartile) = 75th percentile Median Q1 (lower quartile) = 25th percentile Minimum when outliers omitted 20 15 10 5 0 B5 G5 B6 G6 B7 G7 B8 G8 B9 G9 B 10G 10 Box plots of letter recognition scores in each age/sex groupDBA6000: Quantitative Business Research Methods 58
- 59. Part 3. Data summary3.2.5 HistogramsUseful for showing shape of distribution of a numerical variable. Histogram of letter recognition scores 8Number of girls 6 4 2 0 0 5 10 15 20 Score3.2.6 Measures of locationAverage (mean)The average is the sum of the measurements divided by the number of measurements. Usuallydenoted by x. ¯Suppose we have n observations and let x1 denote the ﬁrst observation, x2 the second, and soon up to xn . Then the average is x1 +x2 +···+xn 1 n Sample mean: x = ¯ n = n i=1 xi .This is the most widely used measure of the centre of the data set, and it has good arithmeticproperties. But it does have the drawback of being inﬂuenced by extreme values (“outliers”).Trimmed meanMean of data when smallest and largest 5% of values omitted. • The latter is more resistant to outliers. • The median is the most resistant to outliers. Example: Letter recognition scores Mean Trimmed mean Median 4.10 3.68 3 Example: Air accidents Mean Trimmed mean Median 1.53 1.07 1DBA6000: Quantitative Business Research Methods 59
- 60. Part 3. Data summary QUIZ • True or False? 1. The median and the average of any data set are always close together 2. Half of a data set is always below average. 3. With a large sample, the histogram is bound to follow the normal curve quite closely. • In a study of family incomes, 1000 observations range from $12,400 a year to $132,800 a year. By accident, the highest income gets changed to $1,328,000. 1. Does this affect the mean? If so by how much? 2. Does this affect the median? If so by how much?3.2.7 Measures of spreadRangeThe range is the difference between the maximum and minimum. However, it is not a goodmeasure of spread since it is generally larger when more data are collected and it is sensitive tooutliers.Interquartile rangeThe interquartile range (IQR) is the difference between the upper and lower quartiles: Q3 − Q1. Example: Letter recognition scores Range = 20 - 0 = 20 IQR = 5.1 - 1 = 4.1. Example: Air accidents Range = IQR =Variance and Standard deviationThe variance is based on the deviations from the mean, i.e. the difference between the indi-vidual values and the mean of those values, represented by (xi − x). Obviously, if these were ¯simply added, or averaged, we would always end up with zero. Therefore, we want all valuesto be positive. A simple way to do this is to square the deviations and then average these. Thisis known as the varianceDBA6000: Quantitative Business Research Methods 60
- 61. Part 3. Data summaryThe variance of n observations x1 , x2 , . . . , xn is s2 = 1 n−1 (x1 − x)2 + (x2 − x)2 + · · · + (xn − x)2 ¯ ¯ ¯ n = 1 n−1 (xi − x)2 ¯ i=1Note that this is not quite the average of the squared differences from the mean.We use n − 1 instead of n, as dividing by n tends to underestimate the ‘true’ value. Dividingby n − 1 eliminates this problem.The variance is in squared units, so by taking the square-root of the variance, we have a mea-sure of dispersion that is in the same units of measurement as the original variable. This iscalled the standard deviation, and is denoted by: √ 1 n s= s2 = n−1 i=1 (xi − x)2 . ¯ SD = 1 SD = 5 15 15 Frequency Frequency 10 10 5 5 0 0 −3 −2 −1 0 1 2 3 0 5 10 15 20 25 SD = 10 SD = 100 20 20 Frequency Frequency 15 10 10 5 5 0 0 −130 −120 −110 −100 −90 −80 −300 −200 −100 0 100 200 Example: Letter recognition scores 1 s2 = (0 − 4.1)2 + (0 − 4.1)2 + · · · + (20 − 4.1)2 29 = 20.02 √ s = 20.02 = 4.47.DBA6000: Quantitative Business Research Methods 61
- 62. Part 3. Data summary Example: Air accidents 1 s2 = (0 − 1.53)2 + (0 − 1.53)2 + · · · + (10 − 1.53)2 16 = 6.765 √ s = 6.765 = 2.60. • s > 0 unless all observations have the same value, then s = 0. • s is not a resistant measure of spread. • For many data sets, the standard deviation is approximately 0.6 of the IQR. • Approximately 95% of observations usually fall within 2 standard deviations of the mean. • For small data sets (less than 50 points), the standard deviation is about one quarter of the range.3.2.8 Basic statistics commands in Excel • Mean: =AVERAGE(A1:A20) • Standard deviation: =STDEV(A1:A20) • Median: =MEDIAN(A1:A20) • 75th percentile: =PERCENTILE(A1:A20,0.75)DBA6000: Quantitative Business Research Methods 62
- 63. Part 3. Data summary3.3 Summarising two numerical variables3.3.1 ScatterplotsScatterplots are good at graphically displaying the relationship between two numerical vari-ables.Spotting bivariate outliersDBA6000: Quantitative Business Research Methods 63
- 64. Part 3. Data summary3.3.2 CorrelationThe Pearson correlation coefﬁcient is a measure of the strength of the linear relationship be-tween two numerical variables.It is calculated by 1 n xi −¯ x yi −¯ y r= n−1 i=1 sx sywhere sx is the sample standard deviation of the x observations and sy is the sample standarddeviation of the y observations. • The value of r always lies between −1 and 1. • Positive r indicates positive association between the variables and negative r indicates negative association. • The extreme values r = −1 and r = 1 only occur when the data lie exactly on a straight line. • If the variables have a strong non-linear relationship, r may be small. Always plot the graph. r2 : a useful interpretation The squared correlation, r2 , is the fraction of the variation in the y values that is explained by the linear relationship. High correlation does not prove causality Correlation = −0.99 Correlation = −0.75 Correlation = −0.5 Correlation = −0.25 Correlation = 0.99 Correlation = 0.75 Correlation = 0.5 Correlation = 0.25DBA6000: Quantitative Business Research Methods 64
- 65. Part 3. Data summary q q q q 9 9 10 q q q q q 8 q q 7 q q 8 q y y q 6 q 7 q 5 q 6 q 4 5 q q q 3 4 4 6 8 10 12 14 4 6 8 10 12 14 x x q q 12 12 10 10 y y q q q 8 q q 8 q q q q q q q q q 6 q 6 q q q q q 4 6 8 10 12 14 8 10 12 14 16 18 x xIn every case: n = 11, x = 9.0, y = 7.5 and r = 0.82! ¯ ¯3.3.3 Spearman’s rank correlation • same as ordinary correlation but applied to ranks or ordinal data. • Often used with Likert scales. • Main difference is in computation of p-values.3.4 Measures of reliabilityIn many questionnaires, there are several questions that are designed to measure the samething (sometimes called a “construct”). The answers to the questions are often added togetherto provide an overall “scale” which gives a single measure of the construct.In these circumstances, it is useful to judge how closely the results from the questions are re-lated to each other. This is called “internal consistency reliability”.For example, suppose we constructed a questionnaire to measure people’s level of job-satisfaction.We could provide several statements and ask respondents to give an answer on a 5-point Likertscale (1=Strongly agree, 2=agree, 3=neutral, 4=disagree, 5=strongly disagree): 1. I look forward to going to work each day. 2. I feel I am engaged in work which is useful to my employer. 3. Staff morale in my workplace is generally high. 4. My employer treats me well.DBA6000: Quantitative Business Research Methods 65
- 66. Part 3. Data summaryInternal consistency reliability involves seeing how closely the answers to these questions (or“items”) are related.There are a range of internal consistency measures that can be used.Average inter-item correlationWe can look at the correlation between any pair of items which are supposed to be measuringthe construct.The average inter-item correlation is the average of all correlations between the pairs of items.Split-half reliabilityHere we randomly divide all items that are intended to measure the construct into two sets.The total score for each set of items is then computed for each person. The split-half reliabilityis the correlation between these two total scores.Cronbach’s alphaCronbach’s alpha is the average of all split-half estimates. That is, if we computed all possiblesplit-half reliabilities (by computing it on all possible divisions of items), and averaged theresults, we would have Cronbach’s alpha.In practice, there is a quicker way to compute it than actually doing all these split-half estimates.Suppose there are k items, let si be the standard deviation of the answers to the ith item and sbe the standard deviation of the totals formed by summing all the items for each person. ThenCronbach’s alpha can be calculated as follows: k k 1 α= 1− 2 s2 . i k−1 s i=1How large is good enough? Some books suggest that α > 0.7 is necessary to have a reliablescale. I think this is an arbitrary ﬁgure, but it gives you some idea of what is expected.DBA6000: Quantitative Business Research Methods 66
- 67. Part 3. Data summaryExample 1 Q1 Q2 Q3 Q4 Correlation matrix:1 5 2 2 52 2 3 1 13 2 3 3 1 Q1 Q2 Q3 Q44 3 3 5 15 2 1 2 2 Q1 1.000 0.521 0.250 0.7266 5 5 2 5 Q2 0.521 1.000 0.182 0.3287 2 1 2 28 3 2 2 1 Q3 0.250 0.182 1.000 0.0299 3 3 5 1 Q4 0.726 0.328 0.029 1.00010 1 2 5 211 1 3 3 112 2 2 4 2 Average inter-item correlation:13 3 3 5 314 1 1 1 3 0.33915 3 2 4 416 5 5 3 517 1 1 1 1 Cronbach’s alpha:18 1 4 1 2 0.66419 1 3 1 120 4 5 4 3Example 2 Q1 Q2 Q3 Q4 Correlation matrix:1 1 1 1 12 2 3 2 23 3 2 4 3 Q1 Q2 Q3 Q44 5 5 5 35 1 2 1 1 Q1 1.000 0.819 0.826 0.6846 4 4 2 1 Q2 0.819 1.000 0.697 0.5157 4 3 5 48 4 2 3 5 Q3 0.826 0.697 1.000 0.8139 5 5 5 5 Q4 0.684 0.515 0.813 1.00010 5 5 5 511 4 3 4 512 2 2 3 2 Average inter-item correlation:13 3 3 2 214 3 3 5 5 0.72515 2 3 2 416 4 5 5 517 5 4 5 5 Cronbach’s alpha:18 3 3 3 3 0.91019 5 5 5 420 2 2 1 1Example 3 Q1 Q2 Q3 Q4 Correlation matrix:1 2 2 2 22 4 4 4 43 2 2 2 2 Q1 Q2 Q3 Q44 1 1 1 15 2 2 2 1 Q1 1.000 0.874 0.968 0.9396 4 3 4 4 Q2 0.874 1.000 0.864 0.7957 1 1 1 18 5 3 5 5 Q3 0.968 0.864 1.000 0.9219 4 4 4 4 Q4 0.939 0.795 0.921 1.00010 4 5 4 411 2 2 2 212 5 5 5 3 Average inter-item correlation:13 2 2 2 214 5 5 5 5 0.89415 2 2 3 216 4 2 4 417 4 4 5 4 Cronbach’s alpha:18 5 5 5 5 0.97119 2 2 2 220 4 4 5 4DBA6000: Quantitative Business Research Methods 67
- 68. Part 3. Data summary3.5 Normal distributionOften a set of data, or some statistic calculated from the data, is assumed to follow a normaldistribution. Data which are normally distributed have a histogram with a symmetric bell-shape like this. µ−3σ µ−2σ µ−σ µ µ+σ µ+2σ µ+3σ3.5.1 ParametersThe normal distribution is the basis of many statistical methods. It can be speciﬁed by twoparameters: 1. the mean µ (which determines the centre of the bell); and 2. the standard deviation σ (which determines the spread of the bell). dIf we call the variable Y , we write Y = N(µ, σ 2 ). We use the probability model to draw conclu-sions about future observations.Mean µ: The mean µ is the average of measurements taken from the entire population (ratherthan just a sample). We usually denote this by µ to distinguish it from the sample mean x. The ¯sample mean is often used as an estimate of µ.Standard deviation σ: The standard deviation σ is deﬁned similarly. It is denoted by σ todistinguish it from the sample standard deviation, s. The sample standard deviation is oftenused as an estimate of σ. How do you know if your data are normal? Many statistical methods assume the data are normal, or that the errors from a ﬁtted model are normal. To test this assumption: • Plot the histogram. It should look bell-shaped. • Do a QQ plot on a computer. It should look straight.DBA6000: Quantitative Business Research Methods 68
- 69. Part 3. Data summary3.5.2 Normal Probability TablesProbability of an observation lying within kσ from µ: k Prob. 0.50 38.3% 0.67 50.0% 1.00 68.3% 1.28 80.0% 1.50 86.6% 1.64 90.0% 1.96 95.0% 2.00 95.5% 2.50 98.8% µ-kσ µ µ+kσ 2.58 99.0% 3.00 99.7% 3.29 99.9% 3.89 99.99%Probability of an observation greater than µ + kσ (or less than µ − kσ): k Prob. 0.00 50.0% 0.50 30.9% 0.84 20.0% 1.00 15.9% 1.28 10.0% 1.50 6.7% 1.64 5.0% 2.00 2.3% 2.33 1.0% µ µ+kσ 2.50 0.62% 3.00 0.13% 3.09 0.10% 3.72 0.01% 3.50 0.02%DBA6000: Quantitative Business Research Methods 69
- 70. CHAPTER 4 Computing and quantitative research4.1 Data preparation4.1.1 Data in ExcelOne case per row, one variable per column4.1.2 Things to watch • Missing values are not zeros. • Missing values are not “No” • Keep a spare copy of your data in another location. • Beware of using Excel for statistics. • If you must use Excel for basic statistics, use a different spreadsheet from your main data ﬁle. • For categorical variables, use a code. (e.g., 1, 2, 3, . . . ). 70
- 71. Part 4. Computing and quantitative research Figure 4.1: Typical set-up of excel spreadsheet ready for importing to a statistics package.4.1.3 Data cleaningData cleaning is identifying mistakes and anomalies in your data. • Almost all data is ﬁlthy. • More than 50% of my consulting time is spent cleaning data. • Double entry • Range checks (e.g., age) • Range checks on subsets (e.g., age by pregnant). • Exploratory graphics • Look for anomalies: – The “out-of-range” score – The 2000 degree day. – The 96 year old who is pregnant. – The person earning a negative salary.4.2 Using a statistics package • Almost all stats packages can read an Excel ﬁle directly with variables in columns, cases in rows. • Check the data types and variable deﬁnitions in the statistics package. – Categorical, numerical variables. – Missing values – Choose a package that does what you want easily.DBA6000: Quantitative Business Research Methods 71
- 72. Part 4. Computing and quantitative research4.2.1 Microsoft ExcelAdvantages: Disadvantages: • Widely available • Too easy to enter data without structure • Many people are already familiar with it. • Numerical routines can be unreliable • Intuitive, easy to use. • Graphics are clumsy • Good for data entry • Very limited statistical facilitiesNumerical accuracy in Excel • Computation of p-values inaccurate when close to zero or close to 1. • unstable algorithm for computing variance and standard deviation. (e.g., problems with large numbers) • negative sums of squares for some ANOVA problems. • problems with regression where there is high collinearity. • pseudo random numbers fail some standard tests of randomnessSome of these problems have been known since at least 1994. Microsoft won’t respond to anyrequests for ﬁxes.Conclusion: Don’t use Excel for any extended statistical computation.When to use Excel • For data entry • For simple numerical summaries (means, standard deviations) provided the numbers are not too big. • If you have very little statistical work to do.Excel: Data Analysis add-inDBA6000: Quantitative Business Research Methods 72
- 73. Part 4. Computing and quantitative research4.2.2 SPSSAdvantages: Disadvantages: • Very widely used — lots of people to • Few modern methods included (e.g., help. nonparametric smoothing) • Most standard methods are available. • Lots of irrelevant output. Hard to know • Click and point interface as well as com- what’s important. mand interface. • Routines used not properly docu- mented. • Very difﬁcult to produce customized analysis • Graphics are difﬁcult to customize with code.Guidelines for SPSS • For summary stats on categorical data: Use Analyze – Descriptive Statistics – Frequencies • For summary stats on numerical data: Use Analyze – Descriptive Statistics – Explore • For summary stats on numerical data with a categorical explanatory variable: Use Analyze – Descriptive Statistics – Explore • Make sure the selected method is appropriate for your data.4.2.3 Interactive statistics packages • Click-and-point interface. • Easy to learn and use. • Sometimes limits on data size • Tedious for repetitive tasks and repeated analyses. • Examples: JMP, Statgraphics, Statview.4.2.4 Large statistics packages • Handle large data sets • Generally less interactive and ﬂexible than smaller packages • Some customized analysis possible with programming. • Examples: SPSS, SAS, Systat, Stata, Statistica, S-PlusDBA6000: Quantitative Business Research Methods 73
- 74. Part 4. Computing and quantitative research4.2.5 Statistical programming languages • Extremely ﬂexible in data handling and application of methods. • Can write own routines to do virtually anything. • Most useful for experienced statisticians. • Example: R, S-Plus, Stata4.2.6 Speciality packages • Forecast Pro • EViews (for econometric methods) • Amos (for structural equation modelling)4.2.7 Some more thoughts • Think about methodology you need ﬁrst. • A few good graphs can make an enormous difference to a paper, a talk and a thesis. Spend some time getting them right. • My rule-of-thumb: produce a graph for every p-value. • Packages perform statistical analyses quickly and easily. That means you can make a great many mistakes quickly and easily. • THINK before you CLICK. • If you are using a package which is not so widely used, check the results. Mistakes have been made. • The best data analysis comes not from key strokes or print outs but from spending time thinking.4.2.8 Publication quality graphics • Journals differ in standards. • Excel and SPSS are adequate if you take some care and don’t use the defaults. – reduce the size of data points – remove grid lines – remove or simplify legends – remove coloured background – ﬁx axes and scales – add meaningful titles – etc. • R, S-Plus and Systat produce excellent graphics4.2.9 Choosing a statistics packageFor most purposes: • Excel and SPSS will be satisfactory.DBA6000: Quantitative Business Research Methods 74
- 75. Part 4. Computing and quantitative research • Both are freely available at MonashBUT. . . 1. Does SPSS do what you want? (e.g., SPSS won’t ﬁt an additive model) 2. Do you require customized statistical analysis? (e.g., calculation of variance of residuals from smoothing spline) 3. Do you require interactive or repetitive data analysis? Using commands is worthwhile if you have repetition. 4. What are your colleagues using? They will often be the ﬁrst point of help.Packages at Monash • SPSS is freely available to be installed on any university computer. • Systat is freely available via a site licence. It is very similar to SPSS. Better graphics and some more modern methods. Some SPSS techniques not available. • Minitab is available on MRGS computers. It is also sold relatively cheaply at the computer centre and bookstore. • SAS is extremely expensive, but some departments like it. I ﬁnd it cumbersome. • R is freeware (www.r-project.org) and extremely powerful and ﬂexible. But you need to have a good computing knowledge to use it.4.3 Further reading • Axford, R.L., Grunwald, G.K. and Hyndman, R.J. (1995) “The use of information tech- nology in the research process”. Invited chapter in Health informatics: an overview, (ed. Hovenga, Kidd, Cesnik). • Knusel, L. (1998) On the accuracy of statistical distributions in Microsoft Excel 97, Com- ¨ putational Statistics and Data Analysis, 26, 375–377. • McCullough, B.D. (1999) Assessing the reliability of statistical software. The American Statistician 52, 358–366. • McCullough, B.D. and Wilson, B. (1999) On the accuracy of statistical procedures in Mi- crosoft Excel 97, Computational Statistics and Data Analysis, 31, 27–37. • McCullough, B.D. and Wilson, B. (2002) On the accuracy of statistical procedures in Mi- crosoft Excel 2000 and Excel XP, Computational Statistics and Data Analysis, 40, 713–721. • McCullough, B.D. and Wilson, B. (2005) On the accuracy of statistical procedures in Mi- crosoft Excel 2003, Computational Statistics and Data Analysis, 49, 1244–1252. • Sawitzki, G. (1994) Testing numerical reliability of data analysis systems. Computational Statistics and Data Analysis, 18, 269–286.DBA6000: Quantitative Business Research Methods 75
- 76. Part 4. Computing and quantitative research4.4 SPSS exerciseData setWe will use data on emergency calls to the New York Auto Club (the NY equivalent of theRACV). Download the data from http://www.robhyndman.info/downloads/NYautoclub.xlsand save it to your disk.The variable Calls concerns emergency road service calls from the second half of January in1993 and 1994. In addition, we have the following variables: • Fhigh: the forecast highest temperature for that day; • Flow: the forecast lowest temperature for that day; • Rain: 1=rain or snow forecast for that day; 0 otherwise; • Snow: 1=snow forecast for that day; 0 otherwise; • Weekday: 1=regular workday; 0 otherwise; • Sunday: 1=Sunday; 0 otherwise.The idea is to use these variables to predict the number of emergency calls.Loading data 1. Run SPSS and open the excel ﬁle with the data. 2. Go to the “Variable view” sheet, and ensure the variables are correctly set to Scale (i.e., Numerical) or Nominal (i.e,. Categorical). 3. For the categorical variables, give the values meaningful labels.Data summaries 4. Calculate appropriate summary statistics for all variables. 5. Calculate appropriate summary statistics for the Calls variable separately for Weekdays and weekends. 6. Calculate appropriate summary statistics for the Calls variable separately for rain forecast days and other days.Exploratory graphs 6. Try plotting the number of calls against each of the other variables using an appropriate plot (i.e., scatterplot or boxplot). [Go to Graphs in the menu.] 7. Are there any outliers in the data? 8. Which of the explanatory variables seem to be related to Calls? 9. Do you think the effects of some variables may be confounded with other variables?DBA6000: Quantitative Business Research Methods 76
- 77. CHAPTER 5 Signiﬁcance5.1 Proportions Example: TV ratings survey A survey of 400 people in Melbourne found that 45% were watching the Channel Nine CSI show on Sunday night. The estimated proportion of people watching is 45 p= ˆ = 0.1125 or 11.25%. 4005.1.1 Standard errorsLet’s do a thought experiment. If we were able to collect additional samples of 400 customers,we could calculate p for each sample. Suppose we obtained 999 additional samples of 400 ˆobservations each. We would obtain a different value of p each time because each sample ˆwould be random and different. We now have 1000 values of p, all of them different. The ˆvariability in these p values tells us how accurate p is for estimating p. ˆ ˆOf course, we can’t collect additional samples. We just have one sample. But statistical theorycan be used to calculate the standard deviation of these p values if we were able to conduct ˆsuch an experiment. The standard deviation of p is called the standard error: ˆ p(1−p) s.e.(ˆ) = p n−1where n is the number of observations in our sample. (This is the standard deviation of theestimated proportions if we took many samples of size n and estimated the proportion fromeach sample.)For percentages, the standard error is 100 times that for proportions. Notice that • the standard error depends on the size of the sample but not the size of the target popu- lation (assuming the target population is very large); 77
- 78. Part 5. Signiﬁcance • the standard error is smaller if the sample size is increased. This is to be expected: the more elements in the survey, the more you will know. Example: Customer analysis The standard error of the TV ratings in our sample of 400 is p(1 − p) (0.1125)(0.8875) s.e.(ˆ) = p ≈ = 0.018. n−1 3995.1.2 Conﬁdence intervalsA conﬁdence interval is a range of values which we can be conﬁdent includes the true valueof the parameter of interest, in this case the proportion p. If we wish to construct a conﬁdenceinterval for p we take a multiple of the standard error either side of the estimate of the propor-tion.An approximate 95% conﬁdence interval for the proportion is p ± 1.96s.e.(ˆ). ˆ p(This is an interval which we are 95% sure will contain the true proportion.) Example: Customer analysis An approximate 95% conﬁdence interval for the TV rating is 0.1125 ± 1.96(0.018) = 0.1125 ± 0.036 = [0.077, 0.148]. Notice that this interval is quite wide. If another TV show rates 12.5%, then we can’t say which of the two shows actually had the bigger audience.The 95% conﬁdence interval of the proportion can be interpreted to be the range of values thatwill contain the true proportion with a probability of 0.95. Thus if we calculate the conﬁdenceinterval for a proportion for each of 1000 samples, we would expect that about 950 of the cal-culated conﬁdence intervals would actually contain the true proportion.Other conﬁdence intervals beside 95% intervals can be calculated by replacing 1.96 by a differ-ent multiplying factor.The multiplying factor (1.96 in the example above) depends on the number of observations inthe sample and the conﬁdence level required. • It only works for larger n. For small n, we need a different (and more complex) method of calculation.DBA6000: Quantitative Business Research Methods 78
- 79. Part 5. Signiﬁcance • As the conﬁdence level increases, so does the multiplying factor.These factors are given by tables or calculated by computer. Example: Couples with children Consider an example where the fraction of all Australian married couples with chil- dren was to be estimated and a simple random sample was used. The population characteristic is the proportion of married couples with children in the target pop- ulation. We denote this by p. It cannot be known without surveying the entire population. The statistic is the proportion of married couples with children in the sample. We denote this by p. It is calculated from the survey data as follows. ˆ no. couples with children p= ˆ . no. couples in sample Then the 95% conﬁdence interval for p is approximately p(1 − p) ˆ ˆ p ± 1.96 ˆ n−1 where n = no. couples in the sample, and assuming the target population is very much bigger than the sample. That is, we are 95% conﬁdent that the true proportion of married couples with children lies within that range. For example, if our sample proportion was p = 0.72, where the sample size was ˆ n = 1000, then the 95% conﬁdence interval is (0.72)(0.28) 0.72 ± 1.96 = 0.72 ± 0.03. 9995.1.3 Margin of errorThe margin of error is usually deﬁned as half the width of a 95% conﬁdence interval.So in the TV ratings example, the margin of error was 0.018. In the couples with childrenexample, the margin of error is 0.03.Generally, the margin of error is computed as p(1 − p) ˆ ˆ m = 1.96s.e.(ˆ) = 1.96 p . (5.1) n−1The following table shows the margin of error for proportions for a range of sample sizes andproportions.DBA6000: Quantitative Business Research Methods 79
- 80. Part 5. Signiﬁcance Sample Sample size (n) proportion 100 200 400 600 750 1000 1500 (p) 0.10 0.059 0.042 0.029 0.024 0.021 0.019 0.015 0.20 0.079 0.056 0.039 0.032 0.029 0.025 0.020 0.30 0.090 0.064 0.045 0.037 0.033 0.028 0.023 0.40 0.097 0.068 0.048 0.039 0.035 0.030 0.025 0.50 0.098 0.069 0.049 0.040 0.036 0.031 0.025 0.60 0.097 0.068 0.048 0.039 0.035 0.030 0.025 0.70 0.090 0.064 0.045 0.037 0.033 0.028 0.023 0.80 0.079 0.056 0.039 0.032 0.029 0.025 0.020 0.90 0.059 0.042 0.029 0.024 0.021 0.019 0.015Notice that the margin of error is greatest when p = 0.5. Exercise 4: In television ratings surveys, a simple random sample of 400 house- holds is taken and their viewing patterns recorded in detail. A television station claims a rating of 33% of the total viewing audience. Find a 95% conﬁdence interval for this proportion.5.1.4 Sample size calculationSample size calculation is most often done by ﬁrst specifying what is an acceptable margin oferror for a key population characteristic.If the survey aims to estimate the proportion of couples with children, the key populationcharacteristic is p. Making n the subject of equation (5.1), we obtain 3.84p(1−p) n=1+ m2 .Then substituting in the chosen values for m and p, we can obtain the sample size required.Again, we can ‘guess’ p from previous knowledge of the population such as a pilot survey orprevious surveys.Alternatively, a conservative approach is to use p = 0.5 since this results in the largest samplesize. Using p = 0.5 gives the sample size 0.960 1 n=1+ m2 ≈ m2 .This provides an upper bound on the required sample size. Other values of p will give smallersample sizes.The following table gives sample sizes for different values of m and p.DBA6000: Quantitative Business Research Methods 80
- 81. Part 5. Signiﬁcance Sample Margin of error (m) proportion 0.005 0.01 0.02 0.05 0.10 (p) 0.10 13831 3459 866 140 36 0.20 24588 6148 1538 247 63 0.30 32271 8069 2018 324 82 0.40 36881 9221 2306 370 94 0.50 38417 9605 2402 386 98 0.60 36881 9221 2306 370 94 0.70 32271 8069 2018 324 82 0.80 24588 6148 1538 247 63 0.90 13831 3459 866 140 36 Upper bound 40000 10000 2500 400 100 Exercise 5: For television ratings surveys, what number of people would need to be surveyed for the margin of error to be 2%?5.2 Numerical differences Example: Change in test scores A researcher is studying the change in stress scores over time for a in-house stress man- agement program. 10 employees complete the test at the start of the program, and they do the test again at the end of their ﬁrst 6 weeks on the program. Test 1 Test2 Difference Sample mean of differences: x = 2.5 ¯ 6.1 10.1 4.0 6.4 11.9 5.5 Sample sd of differences: s = 2.2. 6.1 7.6 1.5 4.4 6.9 2.5 • Has there been a signiﬁcant increase 5.8 11.4 5.6 7.0 10.0 3.0 in score? 3.2 5.8 2.6 • Could this increase be due to chance? 5.5 5.3 -0.2 8.3 9.8 1.5 4.2 3.1 -1.15.2.1 Standard errorThe standard deviation of x (i.e., its standard error) is ¯ √ s.e.(¯) = s/ n xwhere s is the standard deviation of the sample data and n is the number of observations in √our sample. So in the example, the standard error is 2.2/ 10 = 0.7. This ﬁgure is used to drawconclusions about the value of µ.DBA6000: Quantitative Business Research Methods 81
- 82. Part 5. Signiﬁcance5.2.2 Conﬁdence intervals for the meanA conﬁdence interval is a range of values which we can be conﬁdent includes the true valueof the parameter of interest, in this case the population mean µ. If we wish to construct aconﬁdence interval for µ we take a multiple of the standard error either side of the estimate ofthe mean. For example a 95% conﬁdence interval for µ in this example is x ± 2.262s.e.(¯). ¯ x Example: Change in test scores That is, the 95% conﬁdence interval is 2.5 ± 2.262(0.7) = 2.488 ± 1.6 = [0.9, 4.0]. A typical computer output for this computation is given below. Variable N Mean StDev SE Mean 95.0 % C.I. Difference 10 2.4884 2.1781 0.6888 ( 0.9302 4.0465 )The 95% conﬁdence interval of the mean can be interpreted to be the range of values that willcontain the true mean with a probability of 0.95. Thus if we calculate the conﬁdence interval fora mean for each of 1000 samples, we would expect that about 950 of the calculated conﬁdenceintervals would actually contain the true mean.Other conﬁdence intervals beside 95% intervals can be calculated by replacing 2.262 by a dif-ferent multiplying factor.The multiplying factor (2.262 in the example above) depends on the number of observations inthe sample and the conﬁdence level required. • As n increases, the multiplying factor decreases. • As the conﬁdence level increases, so does the multiplying factor.These factors are given by tables or calculated by computer.5.2.3 Hypothesis testingAs well as ﬁnding an interval to contain µ it is of interest to test if µ is likely to be equal tosome speciﬁed value. In the example of the change in test results, we wish to know if µ is likelyto be different from zero (i.e., has there been an improvement). To answer this question, weconstruct two competing hypotheses about µ.Deﬁnition: The two complimentary hypotheses in a hypothesis testing problem are called thenull hypothesis and the alternative hypothesis. They are denoted by H0 and H1 respectively.In this example, our two hypotheses are H0 : µ = 0 H1 : µ = 0The null hypothesis states that on average, there is no change in test results whereas the alter-DBA6000: Quantitative Business Research Methods 82
- 83. Part 5. Signiﬁcancenative hypothesis states that on average there is a change in test results.In a hypothesis testing problem, after observing the sample the experimenter must decide ei-ther to accept H0 as true or reject H0 as false and decide in favour of H1 .Deﬁnition: A hypothesis test is a rule that speciﬁes: 1. For which sample values the decision is made to accept H0 as true. 2. For which sample values H0 is rejected and H1 is accepted as true.To make this decision we use a test statistic. That is we calculate the value of some formulawhich is a function of the sample data. The value of the test statistic provides evidence for oragainst the null hypothesis.In the case of a test for the mean µ, the test statistic we use is x ¯ t = s.e.(¯) x Example: Change in test scores x ¯ 2.488 t= = = 3.613. s.e.(¯) x 0.6895.2.4 P-valuesA p-value is the probability of randomly observing a value greater than or equal to the oneobserved, when the null hypothesis is true. The decision to accept the null hypothesis is basedon the p-value.In this context, the p-value is the probability of observing an absolute t value as greater thanor equal to the one observed (3.613), if µ = 0. That’s the same as the probability of observing avalue of x at least as far away from 0 as the x value we obtained for this sample (2.488). This ¯ ¯probability can be calculated easily using a statistical computer package. Example: Change in test scores Test of mu = 0.00 vs mu not = 0.00 Variable N Mean StDev SE Mean T P-Value Difference 10 2.4884 2.1781 0.6888 3.6127 0.0056 So if the population mean µ = 0, then the probability of obtaining a sample mean x more than 2.488 away from µ is 0.0056. This is small enough to believe that the ¯ assumption of µ = 0 is incorrect.If we obtain a ‘large’ p-value, then we say that data similar to that observed are likely to haveoccurred if the null hypothesis was true. Conversely, a small p-value would indicate that itDBA6000: Quantitative Business Research Methods 83
- 84. Part 5. Signiﬁcanceis unlikely that the null hypothesis was true (because if the null hypothesis were true, it isunlikely that such data would occur by chance). The smaller the p-value the more unlikely thenull hypothesis.The p-value is used to deﬁne statistical signiﬁcance. If the p-value is below 0.05 then we saythis result is statistically signiﬁcant. The choice of threshold is completely arbitrary. It is onlyconvention that dictates the use of a 0.05 or 0.01 signiﬁcance level. Instead of saying an effect issigniﬁcant at the 0.05 level, quoting the actual p-value will allow the reader to make their owninterpretation. One-sided tests A one-sided test only looks at the evidence against the null hypothesis in one di- rection (e.g., the mean µ is positive) and ignores the evidence against the null hypothesis in the other direction (e.g., the mean µ is negative). The question of whether a p-value should be one or two-sided may arise; a one- sided p-value is rarely appropriate. Even though there may be a priori evidence to suggest a one-sided effect, we can never really be sure that one treatment, say, is better than another. If we did then there would be no need to do an experiment to determine this! Therefore, routinely use two-sided p-values.5.2.5 Type I and type II errorsThere are at least two reasons why we might get the wrong answer with an hypothesis test.Type I Error is where we accept the alternative hypothesis (reject the null hypothesis) eventhough it is not true. This is sometimes referred to as a false positive. The type I error is set inadvance and is typically 5% (one in 20) or 1% (one in 100). This implies that one in 20 pieces ofscientiﬁc research based on an hypothesis test is mistaken! We use α to denote the probabilityof type I error (the size or level of the test).Type II Error is the risk of rejecting the hypothesis (accepting the null hypothesis) when it is infact true. This is sometimes referred to as a false negative. It is often denoted by β.If the chance of making a type I error is made very small, then automatically the risk of makinga type II error will grow.The power of a statistical test is 1 − β. This is the probability of accepting the alternative hy-pothesis when it is true. Obviously we want this as high as possible. However, the smaller wemake α, the less power we have for the test.These deﬁnitions are summarized in the following table.DBA6000: Quantitative Business Research Methods 84
- 85. Part 5. Signiﬁcance Null hypothesis Null hypothesis Decision false true Reject Correct Type I error null hypothesis Prob = 1 − β Prob = α Don’t reject Type II error Correct null hypothesis Prob = β Prob = 1 − α5.2.6 Summary of key conceptsstandard error: The standard deviation of a statistic calculated from the data, such as a pro- portion or the mean difference.p-value: The probability of observing a value as large as that which was observed if, in fact, there is no real change. If the p-value is small, we reject the hypothesis that there is no change. Usually, this is done when the p-value is smaller than 0.05.95% conﬁdence interval: An interval which contains the true mean change with probability of 95%. So if the conﬁdence interval does not include zero, then the p-value is smaller than 0.05.DBA6000: Quantitative Business Research Methods 85
- 86. Part 5. SigniﬁcanceQuiz from Campbell and MachinEach statement is either true or false. 1. The diastolic blood pressures (DBP) of a group of young men are normally distributed with mean 70mmHg and a standard deviation 10 mmHg. It follows that (a) About 95% of the men have a DBP between 60 and 80 mmHg. (b) About 50% of the men have a DBP above 70 mmHg. (c) The distribution of DBP is not skewed (d) All the DBPs must be less than 100 mmHg. (e) About 2.5% of the men have DBP below 50 mmHg. 2. Following the introduction of a new treatment regime in an alcohol dependency unit, ‘cure’ rates improved. The proportion of successful outcomes in the two years following the change was signiﬁcantly higher than in the preceding two years (p < 0.05). It follows that: (a) If there had been no real change in cure rates, the probability of getting this differ- ence or one more extreme by chance, is less than one in twenty. (b) The improvement in treatment outcome is clinically important. (c) The change in outcome could be due to a confounding factor. (d) The new regime cannot be worse than the old treatment. (e) Assuming that there are no biases in the study method, the new treatment should be recommended in preference to the old. 3. As the size of a random sample increases: (a) The standard deviation decreases. (b) The standard error of the mean decreases. (c) The mean decreases. (d) The range is likely to increase. (e) The accuracy of the parameter estimates increases. 4. A 95% conﬁdence interval for a mean (a) Is wider than a 99% conﬁdence interval. (b) In repeated samples will include the population mean 95% of the time. (c) Will include the sample mean with a probability 1. (d) Is a useful way of describing the accuracy of a study. (e) Will include 95% of the observations of a sample. 5. The p-value (a) Is the probability that the null hypothesis is false (b) Is generally large for very small studies (c) Is the probability of the observed result, or one more extreme, if the null hypothesis were true. (d) Is one minus the type II error (e) Can only take a limited number of discrete values such as 0.1, 0.05, 0.01, etc.DBA6000: Quantitative Business Research Methods 86
- 87. Part 5. SigniﬁcanceBicep circumference[Ref: Bland, J.M. and Altman, D.G. (1986) Statistical methods for assessing agreement betweentwo methods of clinical measurement. Lancet, 307–310.]The table below shows the circumference (cm) of the right and left bicep of 15 right-handedtennis players. Subject Right Left Difference 1 37.50 36.00 0.50 2 35.75 34.50 1.25 3 38.25 38.25 0.00 4 40.50 40.00 0.50 5 32.25 31.50 0.75 6 37.50 36.75 0.75 7 34.75 33.50 1.25 8 35.75 34.75 1.00 9 38.75 38.75 0.00 10 40.25 40.00 0.25 11 37.50 36.75 0.75 12 35.75 35.25 0.50 13 34.00 33.50 0.50 14 40.00 39.25 0.75 15 41.25 40.75 0.50 Mean 37.32 36.63 0.683 Stdev 2.606 2.803 0.438Interpret the following computer output.Paired samples t test on LEFT vs RIGHT with 15 cases Mean RIGHT = 37.317 Mean LEFT = 36.633 Mean Difference = 0.683 95.00% CI = 0.441 to 0.926 SD Difference = 0.438 t = 6.045 df = 14 Prob = 0.000DBA6000: Quantitative Business Research Methods 87
- 88. CHAPTER 6 Statistical models and regressionRegression is useful when there is a numerical response variable and one or more explana-tory variables.6.1 One numerical explanatory variable Recall: Numerical summaries: correlation Graphical summaries: scatterplot. Example: Pulp shipments and price Ref: Makridakis, Wheelwright and Hyndman, 1998. Forecasting: methods and applications, John Wiley & Sons Chapter 5. Pulp shipments World pulp price Pulp shipments World pulp price (millions metric tons) (dollars per ton) (millions metric tons) (dollars per ton) Si Pi Si Pi 10.44 792.32 21.40 619.71 11.40 868.00 23.63 645.83 11.08 801.09 24.96 641.95 11.70 715.87 26.58 611.97 12.74 723.36 27.57 587.82 14.01 748.32 30.38 518.01 15.11 765.37 33.07 513.24 15.26 755.32 33.81 577.41 15.55 749.41 33.19 569.17 16.81 713.54 35.15 516.75 18.21 685.18 27.45 612.18 19.42 677.31 13.96 831.04 20.18 644.59 88
- 89. Part 6. Statistical models and regression6.1.1 Scatterplots‘Eye-balling’ the data would suggest that Shipments decreases with price. A plot of shipmentsagainst price is a good preliminary step to ensure that a linear relationship is appropriate.6.1.2 Statistical modelIn regression problems we are interested in how changes in one variable are related to changesin another. In the case of Shipments and Price we are concerned with how Shipments changeswith Price, not how Price changes with Shipments. The explanatory variable is Price, and theresponse variable it predicts is Shipments.The relationship between the explanatory variable, x, and the response variable, y, is yi = a + bxi + eiwhere a is the intercept of the line, b is the slope, and ei is the error, or that part of the observeddata which is not described by the linear relationship. ei is assumed to be Normally distributedwith mean 0 and standard deviation σ.If we can ﬁnd the line that best ﬁts the data we could then determine what increase in price isassociated with a unit decrease in shipments. 35 30 Pulp shipments 25 20 15 10 500 600 700 800 World pulp priceFigure 6.1: The relationship between world pulp price and pulp shipments is negative. As the priceincreases, the quantity shipped decreases.The line of ‘best’ ﬁt is found by minimizing the sum-of-squares of the deviations from theobserved points to the line. This method is called the method of least squares. So the line ofbest ﬁt minimizes the sum-of-squares of the deviations from the observed points to the line ,that is, it minimizes n n (yi − yi )2 = ˆ (yi − a − ˆ i )2 ˆ bx i=1 i=1DBA6000: Quantitative Business Research Methods 89
- 90. Part 6. Statistical models and regressionwhere a and ˆ are the estimates of a and b. ˆ bUsing calculus we ﬁnd n ˆ = i=1 (xi − x)(yi − ¯ y) ¯ b n ¯2 i=1 (xi − x) a = y − ˆx ˆ ¯ b¯These calculations are done easily using a statistics package (or even a calculator). Example: Pulp price and shipments The regression equation is S = 71.7 − 0.075P. The negative relationship is seen in the downward slope of −0.075. That is, when the price increases by one dollar, sales decrease, on average, by about 75 thousand metric tons. Further, this regression line can be used to predict the mean or expected Y for given X values. For example, when the price is $600 per ton, the predicted shipments is 71.7 − 0.075(600) = 26.7 millions metric tons.6.1.3 Outliers and inﬂuential observations • Outliers: observations which produce large residuals. • Inﬂuential observations: An observation is inﬂuential if removing it would markedly change the position of the regression line. (Often outliers in the x variable). • Lurking variable: an explanatory variable which was not included in the regression but has an important effect on the response.Points should not be removed without a good explanation of why they are different.6.1.4 Residual plotsA useful plot for spotting outliers is the scatterplot of residuals ei against the explanatory vari-able xi . This shows whether a straight line was appropriate. We expect to see a scatterplotresembling a horizontal band with no values too far from the band and no patterns such ascurvature or increasing spread.Another useful plot for spotting outliers and other unwanted features is to plot residualsagainst the ﬁtted values yi . Again, we expect to see no pattern. ˆDBA6000: Quantitative Business Research Methods 90
- 91. Part 6. Statistical models and regression 4 2 Residuals 0 -2 -4 -6 500 600 700 800 World pulp priceFigure 6.2: Residual plot from the pulp regression. Here the residuals show a V-shaped pattern indicat-ing that a straight line relationship is not appropriate for these data.6.1.5 CorrelationRecall: the correlation coefﬁcient is a measure of the strength of the linear relationship.A useful formula: r = ˆ x /sy bsThe pulp price and shipments data have a correlation of r = −0.931, indicating a very strongnegative relationship between pulp price and pulp shipped. If the pulp price increases, thequantity of pulp shipped tends on average to decrease and vice versa.So r2 = 0.867 showing that 86.7% of the variation is explained by the regression line. The other13.3% of the variation is random variation about the line.DBA6000: Quantitative Business Research Methods 91
- 92. Part 6. Statistical models and regressionActivity: Birth weightsThe table below gives the values for 32 babies of x, the birth weight, and y, the increase inweight between the 70th and 100th day of life, as a percentage of birth weight. Birth weight Increase in weight Birth weight Increase in weight (oz) 70–100 days as % of x (oz) 70–100 days as % of x 72 68 125 27 112 63 126 60 111 66 122 71 107 72 126 88 119 52 127 63 92 75 86 88 126 76 142 53 80 118 132 50 81 120 87 111 84 114 123 59 115 29 133 76 118 42 106 72 128 48 103 90 128 50 118 68 123 69 114 93 116 59 94 91 Percentage increase in birth weight (day 70-100) 120 100 80 60 40 80 100 120 140 Birth weight (oz)Computer output for ﬁtting a regression line is given below.The regression equation is Increase = 168 - 0.864 WeightPredictor Coef Stdev t-ratio pConstant 167.87 19.88 8.44 0.000Weight -0.8643 0.1757 -4.92 0.000s = 17.80 R-sq = 44.7% R-sq(adj) = 42.8%What would the expected percentage increase in weight be for an infant whose birth weightwas 94 oz?DBA6000: Quantitative Business Research Methods 92
- 93. Part 6. Statistical models and regression6.2 One categorical explanatory variable Recall: Numerical summaries: group means, standard deviations, etc. Graphical summaries: side-by-side boxplots. Example: Comparative volatility of stock exchanges Data: returns for 30 stocks listed on NASDAQ and NYSE for 9–13 May 1994. We look at absolute return in prices of stocks. This is a measure of volatility. For example, a market where stocks average a weekly 10% change in price (positive or negative) is more volatile than one which averages a 5% change. Graphical summary: boxplots NYSE NASDAQ 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Numerical summaries: NASDAQ NYSE Min. :0.00380 Min. :0.00260 1st Qu.:0.01745 1st Qu.:0.01120 Median :0.03930 Median :0.02480 Mean :0.04395 Mean :0.02913 3rd Qu.:0.05575 3rd Qu.:0.04010 Max. :0.12240 Max. :0.089106.2.1 Statistical modelOur model is that each group has a different mean. So if we let yi,j be the ith measurementfrom the jth group and µj be the mean of the jth group, then we can write the model as yi,j = µj + ei,j dAgain, we assume ei,j = N (0, σ 2 ) That is, all groups have the same standard deviation.We can estimate µj by yj , the sample mean of group j. ¯DBA6000: Quantitative Business Research Methods 93
- 94. Part 6. Statistical models and regression6.2.2 Dummy variableIf a categorical variable takes only two values (e.g., ‘Yes’ or ‘No’), then an equivalent numericalvariable can be constructed taking value 1 if yes and 0 if no. This is called a dummy variable.In this case, the problem becomes identical to the case with a numerical explanatory variable.If there are more than two categories, then the variable can be coded using several dummyvariables (one fewer than the total number of categories). Then the problem is one of severalnumerical explanatory variables and is discussed in the next section.6.3 Several explanatory variablesIn multiple regression there is one variable to be predicted (e.g., sales), but there are two ormore explanatory variables. The general form of multiple regression is Y = b0 + b1 X1 + b2 X2 + · · · + bk Xk + e.Thus if sales were the variable to be modelled, several factors such as GNP, advertising, prices,competition, R&D budget, and time could be tested for their inﬂuence on sales by using re-gression. If it is found that these variables do inﬂuence the level of sales, they can be used topredict future values of sales.Each of the explanatory variables (X1 , . . . , Xk ) is numerical, although it is easy to handle cate-gorical variables in a similar way using dummy variables.Case Study: Mutual savings bank depositsTo illustrate the application of multiple regression, we will use a case study taken from Makri-dakis, Wheelwright and Hyndman, 1998. Forecasting: methods and applications, John Wiley &Sons Chapter 6.These data refer to a mutual savings bank in a large metropolitan area. In 1993 there wasconsiderable concern within the mutual savings banks because monthly changes in depositswere getting smaller and monthly changes in withdrawals were getting bigger. Thus it wasof interest to develop a short-term forecasting model to forecast the changes in end-of-month(EOM) balance over the next few months. Table 6.1 shows 60 monthly observations (February1988 through January 1993) of end-of-month balance (in column 2). Note that there was stronggrowth in early 1991 and then a slowing down of the growth rate since the middle of 1991.Also presented in Table 6.1 are the composite AAA bond rates (in column 3) and the rates onU.S. Government 3-4 year bonds (in column 4). It was hypothesized that these two rates hadan inﬂuence on the EOM balance ﬁgures in the bank.Now of interest to the bank was the change in the end-of-month balance and so ﬁrst differencesof the EOM data in Table 6.1 are shown as column 2 of Table 6.2. These differences, denotedD(EOM) in subsequent equations are plotted in Figure 6.3, and it is clear that the bank wasDBA6000: Quantitative Business Research Methods 94
- 95. Part 6. Statistical models and regression Example: (1) (2) (3) (4) (1) (2) (3) (4) Month (EOM) (AAA) (3-4) Month (EOM) (AAA) (3-4) 1 360.071 5.94 5.31 31 380.119 8.05 7.46 2 361.217 6.00 5.60 32 382.288 7.94 7.09 3 358.774 6.08 5.49 33 383.270 7.88 6.82 4 360.271 6.17 5.80 34 387.978 7.79 6.22 5 360.139 6.14 5.61 35 394.041 7.41 5.61 6 362.164 6.09 5.28 36 403.423 7.18 5.48 7 362.901 5.87 5.19 37 412.727 7.15 4.78 8 361.878 5.84 5.18 38 423.417 7.27 4.14 9 360.922 5.99 5.30 39 429.948 7.37 4.64 10 361.307 6.12 5.23 40 437.821 7.54 5.52 11 362.290 6.42 5.64 41 441.703 7.58 5.95 12 367.382 6.48 5.62 42 446.663 7.62 6.20 13 371.031 6.52 5.67 43 447.964 7.58 6.03 14 373.734 6.64 5.83 44 449.118 7.48 5.60 15 373.463 6.75 5.53 45 449.234 7.35 5.26 16 375.518 6.73 5.76 46 454.162 7.19 4.96 17 374.804 6.89 6.09 47 456.692 7.19 5.28 18 375.457 6.98 6.52 48 465.117 7.11 5.37 19 375.423 6.98 6.68 49 470.408 7.16 5.53 20 374.365 7.10 7.07 50 475.600 7.22 5.72 21 372.314 7.19 7.12 51 475.857 7.36 6.04 22 373.765 7.29 7.25 52 480.259 7.34 5.66 23 372.776 7.65 7.85 53 483.432 7.30 5.75 24 374.134 7.75 8.02 54 488.536 7.30 5.82 25 374.880 7.72 7.87 55 493.182 7.27 5.90 26 376.735 7.67 7.14 56 494.242 7.30 6.11 27 374.841 7.66 7.20 57 493.484 7.31 6.05 28 375.622 7.89 7.59 58 498.186 7.26 5.98 29 375.461 8.14 7.74 59 500.064 7.24 6.00 30 377.694 8.21 7.51 60 506.684 7.25 6.24Table 6.1: Bank data: end-of-month balance (in thousands of dollars), AAA bond rates, and rates for3-4 year government bond issues over the period February 1988 through January 1993.facing a volatile situation in the last two years or so. The challenge is to forecast these rapidlychanging EOM values.In preparation for some of the regression analyses to be done in this chapter, Table 6.2 desig-nates D(EOM) as Y , the response variable, and shows three explanatory variables X1 , X2 , andX3 . Variable X1 is the AAA bond rates from Table 6.1, but they are now shown leading theD(EOM) values. Similarly, variable X2 refers to the rates on 3-4 year government bonds andthey are shown leading the D(EOM) values by one month. Finally, variable X3 refers to the ﬁrstdifferences of the 3-4 year government bond rates, and the timing for this variable coincideswith that of the D(EOM) variable.DBA6000: Quantitative Business Research Methods 95
- 96. Part 6. Statistical models and regression 10 8 6 D(EOM) 4 2 0 -2 8.0 7.5 AAA 7.0 6.5 6.0 8 7 (3-4) 6 5 4 0.5 D(3-4) 0.0 -0.5 1988 1989 1990 1991 1992 1993Figure 6.3: (a) A time plot of the monthly change of end-of-month balances at a mutual savings bank.(b) A time plot of AAA bond rates. (c) A time plot of 3-4 year government bond issues. (d) A time plot ofthe monthly change in 3-4 year government bond issues. All series are shown over the period February1988 through January 1993.DBA6000: Quantitative Business Research Methods 96
- 97. Part 6. Statistical models and regression Example: t Y X1 X2 X3 t Y X1 X2 X3 Month D(EOM) (AAA) (3-4) D(3-4) Month D(EOM) (AAA) (3-4) D(3-4) 1 1.146 5.94 5.31 0.29 31 2.169 8.05 7.46 -0.37 2 -2.443 6.00 5.60 -0.11 32 0.982 7.94 7.09 -0.27 3 1.497 6.08 5.49 0.31 33 4.708 7.88 6.82 -0.60 4 -0.132 6.17 5.80 -0.19 34 6.063 7.79 6.22 -0.61 5 2.025 6.14 5.61 -0.33 35 9.382 7.41 5.61 -0.13 6 0.737 6.09 5.28 -0.09 36 9.304 7.18 5.48 -0.70 7 -1.023 5.87 5.19 -0.01 37 10.690 7.15 4.78 -0.64 8 -0.956 5.84 5.18 0.12 38 6.531 7.27 4.14 0.50 9 0.385 5.99 5.30 -0.07 39 7.873 7.37 4.64 0.88 10 0.983 6.12 5.23 0.41 40 3.882 7.54 5.52 0.43 11 5.092 6.42 5.64 -0.02 41 4.960 7.58 5.95 0.25 12 3.649 6.48 5.62 0.05 42 1.301 7.62 6.20 -0.17 13 2.703 6.52 5.67 0.16 43 1.154 7.58 6.03 -0.43 14 -0.271 6.64 5.83 -0.30 44 0.116 7.48 5.60 -0.34 15 2.055 6.75 5.53 0.23 45 4.928 7.35 5.26 -0.30 16 -0.714 6.73 5.76 0.33 46 2.530 7.19 4.96 0.32 17 0.653 6.89 6.09 0.43 47 8.425 7.19 5.28 0.09 18 -0.034 6.98 6.52 0.16 48 5.291 7.11 5.37 0.16 19 -1.058 6.98 6.68 0.39 49 5.192 7.16 5.53 0.19 20 -2.051 7.10 7.07 0.05 50 0.257 7.22 5.72 0.32 21 1.451 7.19 7.12 0.13 51 4.402 7.36 6.04 -0.38 22 -0.989 7.29 7.25 0.60 52 3.173 7.34 5.66 0.09 23 1.358 7.65 7.85 0.17 53 5.104 7.30 5.75 0.07 24 0.746 7.75 8.02 -0.15 54 4.646 7.30 5.82 0.08 25 1.855 7.72 7.87 -0.73 55 1.060 7.27 5.90 0.21 26 -1.894 7.67 7.14 0.06 56 -0.758 7.30 6.11 -0.06 27 0.781 7.66 7.20 0.39 57 4.702 7.31 6.05 -0.07 28 -0.161 7.89 7.59 0.15 58 1.878 7.26 5.98 0.02 29 2.233 8.14 7.74 -0.23 59 6.620 7.24 6.00 0.24 30 2.425 8.21 7.51 -0.05Table 6.2: Bank data: monthly changes in balance as response variable and three explanatory variables.(Data for months 54–59 to be ignored in all analyses and then used to check forecasts.)Referring to the numbers in the ﬁrst row of Table 6.2, they are explained as follows: 1.146 = (EOM balance Mar. 1988) − (EOM balance Feb. 1988) 5.94 = AAA bond rate for Feb. 1988 5.31 = 3-4 year government bond rate for Feb. 1988 0.29 = (3-4 rate for Mar. 1988) − (3-4 rate for Feb. 1988)(Note that the particular choice of these explanatory variables is not arbitrary, but rather basedon an extensive analysis that will not be presented in detail here.)For the purpose of illustration in this chapter, the last six rows in Table 6.2 will be ignored inall the analyses that follow, so that they may be used to examine the accuracy of the variousmodels to be employed. (The idea is to forecast the D(EOM) ﬁgures for periods 54–59, andDBA6000: Quantitative Business Research Methods 97
- 98. Part 6. Statistical models and regressionthen compare them with the known ﬁgures not used in developing our regression model. Thiscomparison hasn’t actually been in these notes.)The bank could model Y (the D(EOM) variable) on the basis of X1 alone, or on the basis of acombination of the X1 , X2 , and X3 variables shown in columns 3, 4, and 5. So Y , the responsevariable, is a function of one or more of the explanatory variables. Although several differentforms of the function could be written to designate the relationships among these variables, astraightforward one that is linear and additive is Y = b0 + b1 X1 + b2 X2 + b3 X3 + e, (6.1) where Y = D(EOM), X1 = AAA bond rates, X2 = 3-4 rates, X3 = D(3-4) year rates, e = error term.From equation (6.1) it can readily be seen that if two of the X variables were omitted, theequation would be like those handled previously with simple linear regression.Time plots of each of the variables are given in Figure 6.3. These show the four variablesindividually as they move through time. Notice how some of the major peaks and troughs lineup, implying that the variables may be related.Scatterplots of each combination of variables are given in Figure 6.4. These enable us to visu-alize the relationship between each pair of variables. Each panel shows a scatterplot of one ofthe four variables against another of the four variables. The variable on the vertical axis is thevariable named in that row; the variable on the horizontal axis is the variable named in thatcolumn. So, for example, the panel in the top row and second column is a plot of D(EOM)against AAA. Similarly, the panel in the second row and third column is a plot of AAA against(3–4). This ﬁgure is known as a scatterplot matrix and is a very useful way of visualizing therelationships between the variables.Note that the mirror image of each plot above the diagonal is given below the diagonal. Forexample, the plot of D(EOM) against AAA given in the top row and second column is mirroredin the second row and ﬁrst column with a plot of AAA against D(EOM).Figure 6.4 shows that there is a weak linear relationship between D(EOM) and each of theother variables. It also shows that two of the explanatory variables, AAA and (3-4), are relatedlinearly. This phenomenon is known as collinearity and means it may be difﬁcult to distinguishthe effect of AAA and (3-4) on D(EOM).For the bank data in Table 6.2—using only the ﬁrst 53 rows—the model in equation (6.1) can besolved using least squares to give ˆ Y = −4.34 + 3.37(X1 ) − 2.83(X2 ) − 1.96(X3 ). (6.2) ˆNote that a “hat” is used over Y to indicate that this is an estimate of Y , not the observedY . This estimate Yˆ is based on the three explanatory variables only. The difference between ˆthe observed Y and the estimated Y tells us something about the “ﬁt” of the model, and thisdiscrepancy is called the residual (or error):DBA6000: Quantitative Business Research Methods 98
- 99. Part 6. Statistical models and regression 6.0 7.0 8.0 -0.5 0.5 • • • • • •• • • •• • • 2 4 6 8 • • •• • • •• • • • • •• •• • • •• ••••• •• • •• • • • •• • ••• • D(EOM) • • • •• • •• • • •• • • ••• • •• • •• •• • •• •• • •• •• ••• • • • • •• • •• • • • •• •• • •• • • • • ••• • •• • • • •• • • •• •• •• • •• • • • • •• •• • • • • • • • • • • • • • •• • •• • -2 • • • • •• • • •• •• • • •• 8.0 • • • •• • • • • • •• • • • •• • • • •• • • •• •• •• •• • • • • • • • • •• • • • •• •• • • • ••• • • •• • •• • • •• • • • • • ••••• •• • • •• • • • • • • • • • •• • •• ••••• • 7.0 • • • •• AAA • •• • •• •• • • •• • •• •• • • •• •• • • • •• • • •• • • •• •• 6.0 • • •• • •• • • •• •• • • •• • • 8 • •• • • • • • • • ••• • • • • • ••• •• • • • • •• • • 7 •• • • • • • • • • •• • ••• •• • • ••• • • • • (3-4) • • •• • • •• • 6 •• • • • • •• ••• •• • • • • • •• ••• • • • •• • • ••• •• • • • ••• • • • • •••••• • • • ••• • •• • •••• • • •• • • • • ••• • • • 5 • • • • • • • • • • • • 4 • • • • • • 0.5 • • • • • •• ••• • • • • • • • •• • •• • • • • •• •• •• • • •• • •• • •• •• • • • • • • • •••• • • •• • •• •• •••••• • •• • • • •• • • • • •• •• ••• • •• • • •• D(3-4) • • ••• • • • •• •• • • • •• • •• • • •• • • •• • • • • • • • • •• • •• • • •• • • • • • -0.5 • • • •• • • • • • • • • • • • -2 2 6 10 4 5 6 7 8Figure 6.4: Scatterplots of each combination of variables. The variable on the vertical axis is the variablenamed in that row; the variable on the horizontal axis is the variable named in that column. Thisscatterplot matrix is a very useful way of visualizing the relationships between each pair of variables. Example: Y X1 X2 X3 D(EOM) (AAA) (3-4) D(3-4) Y = D(EOM) 1.000 0.257 -0.391 -0.195 X1 = (AAA) 0.257 1.000 0.587 -0.204 X2 = (3-4) -0.391 0.587 1.000 -0.201 X3 = D(3-4) -0.195 -0.204 -0.201 1.000 Table 6.3: Bank data: the correlations among the response and explanatory variables.DBA6000: Quantitative Business Research Methods 99
- 100. Part 6. Statistical models and regression ei = Yi − ˆ Yi ↑ ↑ (observed) (estimated using regression model)Computer outputIn the case of the bank data and the linear regression of D(EOM) on (AAA), (3-4), and D(3-4),the full output from a regression program included the following information: Term Coeff. Value se of bj t P -value Constant b0 −4.3391 3.2590 −1.3314 0.1892 AAA b1 3.3722 0.5560 6.0649 0.0000 (3-4) b2 −2.8316 0.3895 −7.2694 0.0000 D(3-4) b3 −1.9648 0.8627 −2.2773 0.0272R2 = 0.53.Residual analysisFigure 6.5 shows four plots of the residuals after ﬁtting the model D(EOM) = −4.34 + 3.37(AAA) − 2.83(3-4) − 1.96(D(3-4)).These plots help examine the linearity and homoscedasticity assumptions. ˆThe bottom right panel of Figure 6.5 shows the residuals (ei ) against the ﬁtted values (Yi ). Theother panels show the residuals plotted against the explanatory variables. Each of the plots canbe interpreted in the same way as the residual plot for simple regression. The residuals shouldnot be related to the ﬁtted values or the explanatory variables. So each residual plot shouldshow scatter in a horizontal band with no values too far from the band and no patterns such ascurvature or increasing spread. All four plots in Figure 6.5 show no such patterns.If there is any curvature pattern in one of the plots against an explanatory variable, it sug-gests that the relationship between Y and X variable is non-linear (a violation of the linear-ity assumption). The plot of residuals against ﬁtted values is to check the assumption of ho-moscedasticity and to identify large residuals (possible outliers). For example, if the residuals ˆshow increasing spread from left to right (i.e., as Y increases), then the variance of the residualsis not constant.It is also useful to plot the residuals against explanatory variables which were not included inthe model. If such plots show any pattern, it indicates that the variable concerned containssome valuable predictive information and it should be added to the regression model.To check the assumption of normality, we can plot a histogram of the residuals. Figure 6.6shows such a histogram with a normal curve superimposed. The histogram shows the numberof residuals obtained within each of the intervals marked on the horizontal axis. The normalDBA6000: Quantitative Business Research Methods 100
- 101. Part 6. Statistical models and regression 4 4 2 2 Residuals Residuals 0 0 -2 -2 -4 -4 6.0 6.5 7.0 7.5 8.0 4 5 6 7 8 AAA (3-4) 4 4 2 2 Residuals Residuals 0 0 -2 -2 -4 -0.5 0.0 0.5 -4 0 2 4 6 D(3-4) Fitted valuesFigure 6.5: Bank data: plots of the residuals obtained when D(EOM) is regressed against the threeexplanatory variables AAA, (3-4), and D(3-4). The lower right panel shows the residuals plotted against ˆthe ﬁtted values (ei vs Yi ). The other plots show the residuals plotted against the explanatory variables(ei vs Xj,i ). 10 8 Frequency 6 4 2 0 -6 -4 -2 0 2 4 Figure 6.6: Bank data: histogram of residuals with normal curve superimposed.curve shows how many observations one would get on average from a normal distribution. Inthis case, there does not appear to be any problem with the normality assumption.There is one residual (with value −5.6) lying away from the other values which is seen inthe histogram (Figure 6.6) and the residuals plots of Figure 6.5. However, this residual is notsufﬁciently far from the other values to warrant much close attention.DBA6000: Quantitative Business Research Methods 101
- 102. Part 6. Statistical models and regression6.4 Comparing regression modelsComputer output for regression will always give the R2 value. This is a useful summary of themodel. ˆ • It is equal to the square of the correlation between Y and Y . • It is often called the “coefﬁcient of determination”. • It can also be calculated as follows: ˆ ¯ (Yi − Y )2 R2 = ¯ )2 (6.3) (Yi − Y • It is the proportion of variance accounted for (explained) by the explanatory variables X1 , X2 , . . . , Xk .However, it needs to be used with caution. The problem is that R2 does not take into account“degrees of freedom”. This arises because the models are more ﬂexible when more variablesare added. Consequently, adding any variable tends to increase the value of R2 , even if thatvariable is irrelevant.To overcome this problem, an adjusted R2 is deﬁned, as follows: ¯ (total df) n−1 R2 = 1 − (1 − R2 ) = 1 − (1 − R2 ) n−k−1 (error df)where n is the number of observations and k is the number of explanatory variables in the ¯model. Note that R2 is referred to as “adjusted R2 ” or “R-bar-squared,” or sometimes as “R2 ,corrected for degrees of freedom.” ¯There are other measures which, like R2 , can be used to ﬁnd the best regression model. Some ¯computer programs will output several possible measures. Apart from R2 , the most commonlyused measures are Mallow’s Cp statistic and Akaike’s AIC statistic.DBA6000: Quantitative Business Research Methods 102
- 103. Part 6. Statistical models and regression6.5 Choosing regression variablesDeveloping a regression model for real data is never a simple process, but some guidelinescan be given. Generally, we have a long list of potential explanatory variables. The “long list”needs to be reduced to a “short list” by various means, and a certain amount of creativity isessential.There are many proposals regarding how to select appropriate variables for a ﬁnal model. Someof these are straightforward, but not recommended: • Plot Y against a particular explanatory variable (Xj ) and if it shows no noticeable rela- tionship, drop it. • Look at the correlations among the explanatory variables (all of the potential candidates) and every time a large correlation is encountered, remove one of the two variables from further consideration; otherwise you might run into multicollinearity problems (see Sec- tion 6.6). • Do a multiple linear regression on all the explanatory variables and disregard all variables whose p values are very large (say |p| > 0.2).Although these approaches are commonly followed, none of them is reliable in ﬁnding a goodregression model.Some proposals are more complicated, but more justiﬁable: • Do a best subsets regression (see Section 6.5.1). • Do a stepwise regression (see Section 6.5.2). • Do a principal components analysis of all the variables (including Y ) to decide on which are key variables (see Draper and Smith, 1981). • Do a distributed lag analysis to decide which leads and lags are most appropriate for the study at hand.Quite often, a combination of the above will be used to reach the ﬁnal short list of explanatoryvariables.6.5.1 Best subsets regressionIdeally, we would like to calculate all possible regression models using our set of candidate ex-planatory variables and choose the best model among them. There are two problems here. Firstit may not be feasible to compute all the models because of the huge number of combinationsof variables that is possible. Second, how do we decide what is best?We will consider the second problem ﬁrst. A na¨ve approach to selecting the best model would ıbe to ﬁnd the model which gives the largest value of R2 . In fact, that is the model whichcontains all the explanatory variables! Every additional explanatory variable will result in anincrease in R2 . Clearly not all of these explanatory variables should be included. So maximizingthe value of R2 is not an appropriate method for ﬁnding the best model. ¯Instead, we can compare the R2 values for all the possible regression models and select the ¯model with the highest value for R2 . If we have 44 possible explanatory variables, then weDBA6000: Quantitative Business Research Methods 103
- 104. Part 6. Statistical models and regressioncan use anywhere between 0 and 44 of these in our ﬁnal model. That is a total of 244 = 18trillion possible regression models! Even using modern computing facilities, it is impossible tocompute that many regression models in person’s lifetime. So we need some other approach.Clearly the problem can quickly get out of hand without some help. To select the best ex-planatory variables from among 44 candidate variables, we need to use stepwise regression(discussed in the next section).6.5.2 Stepwise regressionStepwise regression is a method which can be used to help sort out the relevant explanatoryvariables from a set of candidate explanatory variables when the number of explanatory vari-ables is too large to allow all possible regression models to be computed.Several types of stepwise regression are in use today. The most common is described below.Step 1: Find the best single variable (X1∗ ).Step 2: Find the best pair of variables (X1∗ together with one of the remaining explanatory variables—call it X2∗ ).Step 3: Find the best triple of explanatory variables (X1∗ , X2∗ plus one of the remaining ex- planatory variables—call the new one X3∗ ).Step 4: From this step on, the procedure checks to see if any of the earlier introduced variables might conceivably have to be removed. For example, the regression of Y on X2∗ and ¯ X3∗ might give better R2 results than if all three variables X1∗ , X2∗ , and X3∗ had been included. At step 2, the best pair of explanatory variables had to include X1∗ , by step 3, X2∗ and X3∗ could actually be superior to all three variables.Step 5: The process of (a) looking for the next best explanatory variable to include, and (b) checking to see if a previously included variable should be removed, is continued until certain criteria are satisﬁed. For example, in running a stepwise regression program, the user is asked to enter two “tail” probabilities: 1. the probability, P1 , to “enter” a variable, and 2. the probability, P2 , to “remove” a variable. When it is no longer possible to ﬁnd any new variable that contributes at the P1 level ¯ to the R2 value, or if no variable needs to be removed at the P2 level, then the iterative procedure stops.The NY Auto Club example (page 106) shows • Putting all the explanatory variables in can lead to a signiﬁcant overall effect but with no way of determining the individual effects of each variable. • Using stepwise regression is useful for choosing only the most signiﬁcant variables in the regression model. • Stepwise regression is not guaranteed to lead to the best possible model.DBA6000: Quantitative Business Research Methods 104
- 105. Part 6. Statistical models and regression • If you are trying several different models, use the adjusted R2 value to select between them.6.6 MulticollinearityIn regression analysis, multicollinearity is the name given to any one or more of the followingconditions: • Two explanatory variables are perfectly correlated. • Two explanatory variables are highly correlated (i.e., the correlation between them is close to +1 or −1). • A linear combination of some of the explanatory variables is highly correlated with an- other explanatory variable. • A linear combination of one subset of explanatory variables is highly correlated with a linear combination of another subset of explanatory variables.The reason for concern about this issue is ﬁrst and foremost a computational one. If perfectmulticollinearity exists in a regression problem, it is simply not possible to carry out the LSsolution. If nearly perfect multicollinearity exists, the LS solutions can be affected by round-off error problems in some calculators and some computer packages. There are computationalmethods that are robust enough to take care of all but the most difﬁcult multicollinearity prob-lems, but not all packages take advantage of these methods. Excel is notoriously bad in thisrespect.The other major concern is that the stability of the regression coefﬁcients is affected by multi-collinearity. As multicollinearity becomes more and more nearly perfect, the regression coef-ﬁcients computed by standard regression programs are therefore going to be (a) unstable—asmeasured by the standard error of the coefﬁcient, and (b) unreliable—in that different computerprograms are likely to give different solution values.Multicollinearity is not a problem unless either (i) the individual regression coefﬁcients are ofinterest, or (ii) attempts are made to isolate the contribution of one explanatory variable to Y ,without the inﬂuence of the other explanatory variables. Multicollinearity will not affect theability of the model to predict.A common but incorrect idea is that an examination of the intercorrelations among the ex-planatory variables can reveal the presence or absence of multicollinearity. While it is true thata correlation very close to +1 or −1 does suggest multicollinearity, it is not true (unless thereare only two explanatory variables) to infer that multicollinearity does not exist when there areno high correlations between any pair of explanatory variables. This point will be examined inthe next two sections.DBA6000: Quantitative Business Research Methods 105
- 106. Part 6. Statistical models and regression6.7 SPSS exercises6.7.1 Predicting NYAC callsRecall the data on emergency calls to the New York Auto Club (the NY equivalent of theRACV). (See p.76.) 1. Try ﬁtting a linear regression model to predict Calls using the other variables. [Go to Analyze → Regression → Linear .] Include all possible explanatory variables and leave “Method” as “Enter”. Which variables are signiﬁcant? Are the coefﬁcients in the direction you expected? Write down the R2 and adjusted-R2 values. 2. Now try ﬁtting the same regression model but with “Method” set to “Stepwise”. This only puts in the variables which are useful and leaves out the others. Now which variables are included? Write down the R2 and adjusted-R2 values. How have they changed? 3. Finally, try the model with explanatory variables Flow and Rain. Write down the R2 and adjusted-R2 values. This shows that the step-wise method in SPSS doesn’t always ﬁnd the best model! 4. This last model should have the following coefﬁcients: Constant 7952.7 Flow −173.8 Rain 1922.2 How can each of these be interpreted? 5. The busiest day of all in 1994 was January 27 when the daily forecast low was 14◦ F and the ground was under six inches of snow. The club answered 8947 calls. Could this have been predicted from the model?DBA6000: Quantitative Business Research Methods 106
- 107. CHAPTER 7 Signiﬁcance in regression7.1 Statistical modelOur model is that the residuals are normally distributed with constant variance. So pleasecheck the residual plots before doing any conﬁdence intervals or tests of signiﬁcance. If thisassumption is invalid, then the conﬁdence intervals and tests are invalid.7.2 ANOVA tables and F-testsWhen a regression model is ﬁtted, we can test whether the model is any better than having novariables at all. The test is conducted using an ANOVA (ANalysis Of VAriance) table. The testis called an F-test. Here the null hypothesis is that no variable has any effect (i.e., all coefﬁcientsare zero). The alternative hypothesis is that at least one variable has some effect (i.e., at leastone coefﬁcient is non-zero).An analysis of variance seeks to split up the variation in the data into two components: thevariation due to the model and the variation left over in the residuals. If the null hypothesisis true (no variable is relevant) then we would expect the variation in the residuals to be muchlarger than the variation in the model. The calculations required to answer this question aresummarized in an “analysis of variance” or ANOVA table. Example: Pulp shipments and price Analysis of Variance Source DF SS MS F P Regression 1 1357.2 1357.2 149.38 0.000 Residual Error 23 209.0 9.1 Total 24 1566.2 107
- 108. Part 7. Signiﬁcance in regressionThe Analysis of Variance (ANOVA) Table above contains six columns; Source of Variation,degrees of freedom (DF), sums of squares (SS), mean square (MS), the variance ratio or F-value(F), and the p-value (P). Of primary interest are the F and P columns. • The F-value follows an F-distribution, and is used to decide if the model is signiﬁcant. • The p-value is the probability that a randomly selected value from the F-distribution is greater than the observed variance ratio. • As a general rule, if the F-probability (or p-value) is less than 0.05 then the model is deemed to be signiﬁcant. In this case, there is a signiﬁcant effect due to the included variable. • If there are two groups, the p-value from the ANOVA (F-test) is the same as the p-value from the t-test (provided a t-test with “pooled variance” is used).7.3 t-tests and conﬁdence intervals for coefﬁcientsThe regression equation gives us the equation for the line best relating shipments to price. Wenow ask a statistical question namely, is the relationship signiﬁcant? In the context of linearregression a relationship is signiﬁcant if the slope of the line is signiﬁcantly different from zero.Since a slope which is equal to zero would imply that as price increases shipments remainsunchanged, that is no relationship.To test the signiﬁcance of the relationship between shipments and price the hypotheses are: H0 : b = 0 H1 : b = 0.As usual, if the p-value is less than 0.05 then the linear regression is deemed to be signiﬁcant.This means that the estimated slope of the line is signiﬁcantly greater than zero. Example: Pulp shipments and price The regression equation is Shipments = 71.7 - 0.0751 Price Predictor Coef StDev T P Constant 71.668 4.195 17.08 0.000 Price -0.075135 0.006147 -12.22 0.000 S = 3.014 R-Sq = 86.7% R-Sq(adj) = 86.1% The slope is estimated to be -0.075 with a standard error of 0.006. The estimated slope divided by the standard error gives the t statistic. The p-value is 0.000, which is identical to the p-value of the ANOVA as it should be.DBA6000: Quantitative Business Research Methods 108
- 109. Part 7. Signiﬁcance in regressionActivity: Birth weights againx = birth weighty = increase in weight between the 70th and 100th day of life, as a percentage of birth weight. Percentage increase in birth weight (day 70-100) 120 100 80 60 40 80 100 120 140 Birth weight (oz)Computer output for ﬁtting a regression line is given below.The regression equation is Increase = 168 - 0.864 WeightPredictor Coef Stdev t-ratio pConstant 167.87 19.88 8.44 0.000Weight -0.8643 0.1757 -4.92 0.000s = 17.80 R-sq = 44.7% R-sq(adj) = 42.8%Analysis of VarianceSOURCE DF SS MS F pRegression 1 7666.4 7666.4 24.20 0.000Error 30 9502.1 316.7Total 31 17168.5 1. Is there an association between birth weight and % weight increase in the 70th to 100th day? 2. Is the constant in the regression equation needed? 3. Find a 95% conﬁdence interval for the slopeDBA6000: Quantitative Business Research Methods 109
- 110. Part 7. Signiﬁcance in regression7.3.1 Two groupsWhen the explanatory variable takes only two values (e.g., male/female), we use a two-samplet-test and associated methods. The interpretation is similar to the paired t-test used in theprevious section. • The p-value gives the probability of the group means being as different as was observed if there was no real difference between the groups. • The 95% conﬁdence intervals contain the true difference between the means of the two groups with probability 0.95. Example: Stock exchange volatility again Data: returns for 30 stocks listed on NASDAQ and NYSE for 9–13 May 1994. We look at absolute return in prices of stocks. This is a measure of volatility. For example, a market where stocks average a weekly 10% change in price (positive or negative) is more volatile than one which averages a 5% change. Numerical summaries: NASDAQ NYSE Min. :0.00380 Min. :0.00260 1st Qu.:0.01745 1st Qu.:0.01120 Median :0.03930 Median :0.02480 Mean :0.04395 Mean :0.02913 3rd Qu.:0.05575 3rd Qu.:0.04010 Max. :0.12240 Max. :0.08910 Analysis of Variance Table Response: absreturn Df Sum Sq Mean Sq F value Pr(>F) exchange 1 0.003293 0.003293 4.0405 0.04908 * Residuals 58 0.047270 0.000815 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Conﬁdence interval for difference of means [0.00001, 0.00296] Conclusion There is some evidence (but not very strong evidence) that the NASDAQ is more volatile than the NYSE.DBA6000: Quantitative Business Research Methods 110
- 111. Part 7. Signiﬁcance in regression7.4 Post-hoc testsWhen we have a categorical explanatory variable with more than two categories, it is natural toask which categories are different from each other? For example, if the variable is “Day of theweek”, then are all days different from each other, or are weekends different from weekdays,or something more complicated?The inclusion or otherwise of the variable is determined by an F-test or an adjusted R2 value.But to test differences between levels of a category requires a post-hoc test.7.5 SPSS exercises7.5.1 Call centre patternsWe will use data on the number of calls to a Melbourne call centre. Download the data from http://www.robhyndman.info/downloads/Calls.xlsand save it to your disk.The variable Calls gives the total number of calls each day. The variable Trend gives thesmooth trend through the data eliminating the effect of daily ﬂuctuations. 1. Produce a time plot of the data over time with the trend on the same graph. Can you explain the ﬂuctuations in the trend? 2. Calculate the percentage deviation from the trend for each day. 3. Compute summary statistics and boxplots for the deviations for each day. 4. Use an ANOVA test to check that the percentage deviations for each day are signiﬁcantly different from each other. 5. Which days are signiﬁcantly different?DBA6000: Quantitative Business Research Methods 111
- 112. CHAPTER 8 Dimension reduction8.1 Factor analysisFactor analysis is most useful as a way of combining many numerical explanatory variablesinto a smaller number of numerical explanatory variablesBasic idea: • You try to uncover some underlying, but unobservable quantities called “factors”. Each variable is assumed to be (approximately) a linear combination of these factors. • For example, if there are two factors called F1 and F2 , then the ith observed variable Xi can be written as Xi = b0 + b1 F1 + b2 F2 + error. The coefﬁcients b0 , b1 and b2 differ for each of the observed variables. • The factors are assumed to be independent of each other. • The factors are chosen so they explain as much of the variation in the observed variables as possible. • The factor loadings are the the values of b1 and b2 in the above equation. • Principal components analysis is the usual method for estimating the factors. • The estimated factors (or scores) can be used as an explanatory variable in subsequent regression models. 112
- 113. Part 8. Dimension reductionExample: National track recordsThe data on national track records for men are listed in the following table Country 100m 200m 400m 800m 1500m 5000m 10000m Marathon (s) (s) (s) (min) (min) (min) (min) (min) Argentina 10.39 20.81 46.84 1.81 3.70 14.04 29.36 137.72 Australia 10.31 20.06 44.84 1.74 3.57 13.28 27.66 128.30 Austria 10.44 20.81 46.82 1.79 3.60 13.26 27.72 135.90 Belgium 10.34 20.68 45.04 1.73 3.60 13.22 27.45 129.95 Bermuda 10.28 20.58 45.91 1.80 3.75 14.68 30.55 146.62 Brazil 10.22 20.43 45.21 1.73 3.66 13.62 28.62 133.13 . . . Turkey 10.71 21.43 47.60 1.79 3.67 13.56 28.58 131.50 USA 9.93 19.75 43.86 1.73 3.53 13.20 27.43 128.22 USSR 10.07 20.00 44.60 1.75 3.59 13.20 27.53 130.55 W.Samoa 10.82 21.86 49.00 2.02 4.24 16.28 34.71 161.83Correlation matrix: 9.0 10.0 6.6 7.2 7.8 5.0 5.6 6.2 4.4 5.0 9.5 X100m 8.5 10.0 X200m 9.0 X400m 8.5 7.5 7.8 7.2 X800m 6.6 6.6 X1500m 6.0 6.2 X5000m 5.6 5.0 6.0 5.4 X10000m 4.8 5.0 Marathon 4.4 8.5 9.5 7.5 8.5 6.0 6.6 4.8 5.4 6.0 Figure 8.1: Scatterplot matrix of national track record data. All data in average metres per second.DBA6000: Quantitative Business Research Methods 113
- 114. Part 8. Dimension reduction 100m 200m 400m 800m 1500m 5000m 10000m Marathon 100m 1.00 0.92 0.83 0.75 0.69 0.60 0.61 0.50 200m 0.92 1.00 0.85 0.80 0.77 0.69 0.69 0.59 400m 0.83 0.85 1.00 0.87 0.83 0.77 0.78 0.70 800m 0.75 0.80 0.87 1.00 0.91 0.85 0.86 0.80 1500m 0.69 0.77 0.83 0.91 1.00 0.93 0.93 0.86 5000m 0.60 0.69 0.77 0.85 0.93 1.00 0.97 0.93 10000m 0.61 0.69 0.78 0.86 0.93 0.97 1.00 0.94 Marathon 0.50 0.59 0.70 0.80 0.86 0.93 0.94 1.00 Loadings: Factor1 Factor2X100m 0.275 0.918X200m 0.379 0.886X400m 0.546 0.736X800m 0.684 0.623X1500m 0.799 0.527X5000m 0.904 0.382X10000m 0.911 0.387Marathon 0.914 0.271 Factor1 Factor2SS loadings 4.108 3.205Proportion Var 0.513 0.401Cumulative Var 0.513 0.914Test of the hypothesis that 2 factors are sufficient.The chi square statistic is 15.49 on 13 degrees of freedom.The p-value is 0.278Factor 1 seems to be mostly a measure of long-distance events. Factor 2 is mostly a measure ofshort distance events. 2 1 0 Factor2 −1 −2 −3 Cook Islands −3 −2 −1 0 1 Factor1 Figure 8.2: Scatter plot of the two estimated factors.DBA6000: Quantitative Business Research Methods 114
- 115. Part 8. Dimension reductionExample: Stock price dataStock price data for 100 weekly rates of return on ﬁve stocks are listed below. The data werecollected for January 1975 through December 1976. The weekly rates of return as deﬁned as current Friday closing price − previous Friday closing price Rate of return = previous Friday closing priceadjusted for stock splits and dividends. Week Allied Chemical Du Pont Union Carbide Exxon Texaco 1 0.000000 0.000000 0.000000 0.039473 -0.000000 2 0.027027 -0.044855 -0.003030 -0.014466 0.043478 3 0.122807 0.060773 0.088146 0.086238 0.078124 4 0.057031 0.029948 0.066808 0.013513 0.019512 5 0.063670 -0.003793 -0.039788 -0.018644 -0.024154 . . . 99 0.050167 0.036380 0.004082 -0.011961 0.009216 100 0.019108 -0.033303 0.008362 0.033898 0.004566 Allied Chemical Du Pont Union Carbide Exxon Texaco Allied Chemical 1.00 0.58 0.51 0.39 0.46 Du Pont 0.58 1.00 0.60 0.39 0.32 Union Carbide 0.51 0.60 1.00 0.44 0.43 Exxon 0.39 0.39 0.44 1.00 0.52 Texaco 0.46 0.32 0.43 0.52 1.00Loadings: Factor1 Factor2Allied Chemical 0.683 0.192Du Pont 0.692 0.519Union Carbide 0.680 0.251Exxon 0.621Texaco 0.794 -0.439 Factor1 Factor2SS loadings 2.424 0.567Proportion Var 0.485 0.113Cumulative Var 0.485 0.598Test of the hypothesis that 2 factors are sufficient.The chi square statistic is 0.58 on 1 degree of freedom.The p-value is 0.448Factor 1 seems to be almost equally weighted and therefore indicates overall measure of mar-ket activity. Factor 2 represents a contrast between the chemical stocks (Allied Chemical, DuPont and Union Carbide) and the oil stocks (Exxon and Texaco). Thus it measures an industryspeciﬁc difference.DBA6000: Quantitative Business Research Methods 115
- 116. Part 8. Dimension reduction Time series of stock returns 0.08 0.10Allied Chemical 0.05 0.04 Exxon 0.00 0.00 −0.05 0.08 −0.04 0.10 −0.10 0.04 0.05Du Pont Texaco 0.00 0.00 −0.05 −0.04 0.10 0 20 40 60 80 100 Time 0.05Union Carbide 0.00 −0.05 0 20 40 60 80 100 Figure 8.3: Time series of ﬁve stocks between January 1975 and December 1976.DBA6000: Quantitative Business Research Methods 116
- 117. Part 8. Dimension reduction −0.05 0.05 −0.04 0.02 0.08 0.10 Allied Chemical 0.00 −0.100.05 Du Pont−0.05 0.05 Union Carbide −0.050.080.02 Exxon−0.04 0.02 0.06 Texaco −0.04 −0.10 0.00 0.10 −0.05 0.05 −0.04 0.02 0.06 Figure 8.4: Scatterplots of the ﬁve stocks.DBA6000: Quantitative Business Research Methods 117
- 118. Part 8. Dimension reduction8.2 Further reading8.2.1 Factor analysis • J.F. H AIR , R.E. A NDERSON , R.L. TATHAM and W.C. B LACK (1998). Multivariate data analysis, 5th Edition, Prentice Hall.DBA6000: Quantitative Business Research Methods 118
- 119. CHAPTER 9 Data analysis with a categorical response variable9.1 Chi-squared test Example: What’s your excuse? The following results are from a survey of students’ excuses for not sitting exams. United States France Britain Dead grandparent 158 22 220 Car problem 187 90 45 Animal trauma 12 239 8 Crime victim 65 4 125 Do different nationalities have different excuses?A two-way table consists of frequencies of observations split up by two categorical variables.Each combination of values for the two variables deﬁnes a cell.The question of interest is: is there a relation between the two variables?H0 : there is no association between the two variablesH1 : the two variables are not independent. (O−E)2We use a χ2 test. X 2 = E where O = observed values and E = expected values(assuming independence). • Large values of X 2 provide evidence against H0 . • Small p-values provide evidence against H0 . 119
- 120. Part 9. Data analysis with a categorical response variableNotes on Chi-square tests • The results are only valid if the cell counts are large. For 2 × 2 tables: all expected cell counts ≥ 5. Where this is not true, use Fisher’s exact test. • It is always two-sided • If there are too few values in a cell, try combining rows or columns. Example: What’s your excuse? Expected counts are printed below observed counts US France Britain Total 1 158 22 220 400 143.66 120.85 135.49 2 187 90 45 322 115.65 97.29 109.07 3 12 239 8 259 93.02 78.25 87.73 4 65 4 125 194 69.67 58.61 65.71 Total 422 355 398 1175 ChiSq = 1.431 + 80.856 + 52.713 + 44.026 + 0.546 + 37.635 + 70.568 +330.222 + 72.459 + 0.314 + 50.886 + 53.491 = 795.146 df = 6, p = 0.000DBA6000: Quantitative Business Research Methods 120
- 121. Part 9. Data analysis with a categorical response variable Example: Snoozing How often do you press the snooze button in the morning? Bentley college Babson College Total Once 22 1 23 Twice 18 12 30 3 times 32 25 57 4 times 11 22 33 5 times 5 15 20 6+ times 12 25 37 Total 100 100 200 Expected counts are printed below observed counts C1 C2 Total 1 22 1 23 11.50 11.50 2 18 12 30 15.00 15.00 3 32 25 57 28.50 28.50 4 11 22 33 16.50 16.50 5 5 15 20 10.00 10.00 6 12 25 37 18.50 18.50 Total 100 100 200 ChiSq = 9.587 + 9.587 + 0.600 + 0.600 + 0.430 + 0.430 + 1.833 + 1.833 + 2.500 + 2.500 + 2.284 + 2.284 = 34.468 df = 5, p = 0.000 So there is statistical evidence of a strong association between the variables. Hence we can conclude that students from different colleges have different sleep patterns.DBA6000: Quantitative Business Research Methods 121
- 122. Part 9. Data analysis with a categorical response variable Example: Survival and Pet ownership Does having a pet help survival from coronary heart disease? A 1980 study investigated 92 CHD patients who were classiﬁed according to whether they owned a pet and whether they survived for 1 year. Patient Status Owned Pet No Pet Alive 50 28 Dead 3 11 Expected counts are printed below observed counts Had.pet No.pet Total Alive 50 28 78 44.93 33.07 Dead 3 11 14 8.07 5.93 Total 53 39 92 ChiSq = 0.571 + 0.776 + 3.181 + 4.323 = 8.851 df = 1, p = 0.003 1. Calculate relevant proportions. 2. What do you conclude?9.2 Logistic and multinomial regressionLogistic regression is used when the response variable is categorical with two categories (e.g.,Yes/No). The model allows the calculation of the probability of a “Yes” given the set of ex-planatory variables.Multinomial regression is a regression model where the response variable is categorical withmore than two categories.Useful reference • K LEINBAUM , D.G., and K LEIN , M. (2002) Logistic regression: a self-learning text, 2nd ed, Springer-Verlag.DBA6000: Quantitative Business Research Methods 122
- 123. Part 9. Data analysis with a categorical response variable9.3 SPSS exercises 1. Repeat the examples in Section 9.1 using SPSS to ﬁnd the p-values. 2. In a study of health in Zambia, people were rated as having ‘good’, ‘fair’ or ‘poor’ health. Similarly, the economy of the village in which each person lived was rated as ‘poor’, ‘fair’ or ‘good’. For the 521 villagers assessed, the following data were observed. Health Village Good Fair Poor Total Poor 62 103 68 233 Fair 50 36 33 119 Good 80 69 20 169 Total 192 208 121 521 (a) Find a 95% conﬁdence interval for the proportion of poor villages in Zambia. (b) Use SPSS to carry out a chi-squared test for independence on these data. (c) Explain in one or two sentences how these data differ from what you would expect if health and village were independent. (d) Do these data show that economic prosperity causes better health? Explain in one or two sentences. (e) Consider now only people from poor villages. What proportion of these people have health that is rated less than good? Give a 95% conﬁdence interval for this proportion. (f) An alternative approach to this problem would have been to measure health numer- ically for each person. What sort of analysis would have been most appropriate in that case?DBA6000: Quantitative Business Research Methods 123
- 124. CHAPTER 10 A survey of statistical methodologyI use a decision tree based on the type of response variable and the type of explanatory vari-able(s). Recall: Response and explanatory variables Response variable: measures the outcome of a study. Also called dependent vari- able. Explanatory variable: attempts to explain the variation in the observed outcomes. Also called independent variables. Many statistical problems can be thought of in terms of a response variable and one or more explanatory variables. Sometimes the response variable is called the dependent variable and the explana- tory variables are called the independent variables. • Study of level of stress-related leave amongst Australian small business em- ployees. – Response variable: No. days of stress-related leave in ﬁxed period. – Explanatory variables: Age, gender, business-type, job-level. • Return on investment in Australian stocks. – Response variable: Return – Explanatory variables: industry, risk proﬁle of company, etc. 124
- 125. Part 10. A survey of statistical methodology Taxonomy of statistical methodology RESPONSE VARIABLE: NumericalEXPLANATORY None Numerical CategoricalVARIABLE:Graphics Boxplot, histogram Scatterplot Side by side box- plotsSummary Stats Mean, percentiles, Correlation Mean, st.dev. by IQR, st.dev. group, percentiles by groupMethods t-test, conﬁdence Regression 2 sample t-test (2 intervals groups), one-way ANOVA Regression, General Linear Model RESPONSE VARIABLE: CategoricalEXPLANATORY None Numerical CategoricalVARIABLE:Graphics Bar chart Side-by-side box- Side-by-side bar plots chartsSummary Stats Percentages, pro- Mean, st.dev. by Percentages by portions group group, contin- gency tablesMethods Conﬁdence inter- Logistic regression Chi-square test vals Logistic regression, Generalized Linear Model DBA6000: Quantitative Business Research Methods 125
- 126. Part 10. A survey of statistical methodologyNumerical response variable, no explanatory variables • Graphical summaries: boxplot, histogram. • Numerical summaries: percentiles, mean, median, standard deviation, etc. • Statistical methods: conﬁdence intervals and t-testCategorical response variable, no explanatory variables • Graphical summaries: bar chart • Numerical summaries: group frequencies/percentages • Statistical methods: conﬁdence interval for proportion.Numerical response variable, one numerical explanatory variable • Graphical summaries: scatterplots E XAMPLES : • Numerical summaries: correlation • Income and age • Statistical methods: regression • Sales and advertising • Number of accidents and length of time worked. • Turnover of company and number of employeesCategorical response variable, one categorical explanatory variable • Graphical summaries: bar charts and E XAMPLES : variants • Gender and voting preference • Numerical summaries: cross tabula- • Religion and education level tions. • Head ofﬁce location and industry • Statistical methods: – test for two proportions – χ2 test for independence(Note: response and explanatory variables reversible)Numerical response variable, one categorical explanatory variable • Graphical summaries: boxplots E XAMPLES : • Numerical summaries: group means, • Strength of agreement to survey ques- standard deviations. tion and location • Statistical methods: • Income and gender – t-tests • Stock return and use of hedging – one-way ANOVADBA6000: Quantitative Business Research Methods 126
- 127. Part 10. A survey of statistical methodologyCategorical response variable, one numerical explanatory variable • Graphical summaries: boxplots E XAMPLES : • Numerical summaries: group means, • Mortality and pollutant level standard deviations. • Language spoken and length of time in • Statistical methods: logistic regression. Australia • Bankruptcy and level of investment riskNumerical response variable, several explanatory variablesStatistical methods • Multiple regression (all numerical explanatory variables) • multi-way ANOVA (all categorical explanatory variables) • General linear modelCategorical response variable, several explanatory variablesStatistical methods • Multi-way contingency table and χ2 test (all categorical explanatory variables). • Multiple logistic regression (all numerical explanatory variables) • Generalized linear modelOther methodsM ULTIVARIATE METHODS • Used where there is more than one nu- • Factor analysis merical response variable • Principal components • Used to combine the number of explana- tory variablesT IME SERIES • Forecasting • Used where the response variable is ob- • Spectral analysis served over time and where time is treated like an explanatory variable.DBA6000: Quantitative Business Research Methods 127
- 128. Part 10. A survey of statistical methodologyCase 1: Lactobacillus countsmeasured in 21 people with different degrees of susceptibility to dental caries. Group 1: Rampant caries (5+ new lesions in past year) Group 2: Normal caries (1–4 new lesions in past year) Group 3: Caries resistant (no lesions in the past year)Lactobacillus counts (in thousands): Group 1 118 562 722 238 169 133 201 Group 2 422 109 261 147 330 97 Group 3 278 150 69 164 95 131 170 68Response variable:Explanatory variables:Type of analysis appropriate: 700 500 300 100 Rampant caries Normal caries Caries resistantAnalysis of Variance on LactobilSource DF SS MS F pCaries. 2 102691 51345 2.02 0.162Error 18 458242 25458Total 20 560933 Individual 95% CIs For Mean Based on Pooled StDevLevel N Mean StDev ---------+---------+---------+------- 1 7 306.1 237.4 (----------*---------) 2 6 227.7 131.9 (----------*----------) 3 8 140.6 68.6 (---------*---------) ---------+---------+---------+-------Pooled StDev = 159.6 120 240 360DBA6000: Quantitative Business Research Methods 128
- 129. Part 10. A survey of statistical methodologyCase 2: Charlie’s chooks157 chicken farmers are worried about bird mortality. There are two types of birds: Tegel(Australian) and imported.Response variable:Explanatory variables:Type of analysis appropriate: 14 12 Y: Percentage mortality 10 8 6 4 0 20 40 60 80 100 X: Percentage Tegel birdsRegression analysis: Mortality and Tegel.pcCoefficients: Value Std. Error t value Pr(>|t|)(Intercept) 4.2986 0.7063 6.0864 0.0000 tegelpc 0.0168 0.0079 2.1276 0.0350Residual standard error: 1.852 on 155 degrees of freedomMultiple R-Squared: 0.02838F-statistic: 4.527 on 1 and 155 degrees of freedom,the p-value is 0.03495 14 12 Y: Percentage mortality 10 8 6 4 0 20 40 60 80 100 X: Percentage Tegel birdsDBA6000: Quantitative Business Research Methods 129
- 130. Part 10. A survey of statistical methodologyExample 1: Stress-related leaveWhat factors contribute to stress-related leave amongst Australian small business employees?Response variable:Explanatory variables:Type of analysis appropriate:Example 2: Return on investmentWhat types of Australian companies give the greatest return on investment?Response variable:Explanatory variables:Type of analysis appropriate:Example 3: English language proﬁciencyLarge study of migrants from many different non-English speaking backgrounds. Want toknow what variables affect ability to speak English well.Response variable:Explanatory variables:Type of analysis appropriate:DBA6000: Quantitative Business Research Methods 130
- 131. CHAPTER 11 Further methods11.1 Classiﬁcation and regression treesA method for predicting a response variable given a set of explanatory variables constructedusing a “tree”. This is constructed using binary splits of the explanatory variables. The re-sponse variable can be categorical or numerical. The explanatory variables can be categoricalor numerical.Example: completions of research students at Monash universityThe data were provided by the Monash Research Graduate School (MRGS). There were 445HDR students in the data set: 200 from the 1993 cohort and 245 from the 1994 cohort. Thesewere all the students who had received an Australian Postgraduate Award (APA), a MonashGraduate Scholarship (MGS) or some other centrally-administered postgraduate scholarship.The completion status of these students in July 2001 fell into two categories: 1. Completed, for students having their theses passed and degrees awarded, or whose theses were submitted but still under examination; 2. Not completed, for all other students including students continuing to study in July 2001, students who had discontinued candidature and those whose theses were submitted but not passed.The following ten variables were considered for their potential effect on students’ completionstatus. 1. Faculty: the faculty that each student enrolled in. There are ten faculties. 2. Course: PhD or Master. 3. Age: the age in years of each student on the date of enrolment. 4. Gender: male or female. 5. Lab-based: whether the research of a student is laboratory-based or not. 6. International student: whether a student is an international or a local student. 7. First language: whether a student’s ﬁrst language is English or not. 8. Admission degree: Honours, Masters or pass. 131
- 132. Part 11. Further methods 9. Admission qualiﬁcation: the honors level of each student’s admission qualiﬁcation. This will be ﬁrst class honors (H1), second class honors level A (H2A) or second class honors level B (H2B). Note that Masters candidates were classiﬁed as either H1, H2A or H2B equivalent. 10. Publications: whether a student had any publications when admitted to enter the course.The tree is constructed by recursively partitioning the data into two groups at each step. Thevariable used for partitioning the data at each step is selected to maximize the differences be-tween the two groups. The splitting process is repeated until the groups are too small to besplit into signiﬁcantly different groups. Classification tree 0.6612 n = 428 Arts | BusEco Faculty ArtDesign IT, Law Education Medicine Engineering Pharmacy 0.5226 Science n = 199 0.7817 n = 229 Arts Faculty No Publication Yes 0.472 0.7456 0.8833 n = 125 n = 169 n = 60 Age ArtDesign Admission Admission >=21.7 yrs Education H1 qualification degree 0.4538 Engineering 0.7346 n = 119 0.6081 n = 162 n = 74 < 21.7 yrs International Age H2A Honours Masters 0.8333 Student < 23.2 yrs >=23.2 yrs 1 Pass 1 n=6 0.7912 0.662 n=7 0.8478 n = 14 n = 91 n = 71 n = 46 No Yes Age Faculty 0.4141 0.65 n = 99 n = 20 < 22.6 yrs >=22.6 yrs BusEco Law 0.7532 1 IT Pharmacy n = 77 n = 14 Science Medicine 0.623 0.9 n = 61 n = 10 Figure 11.1: Classiﬁcation tree.Figure 11.1 shows a classiﬁcation and regression tree for the completion rate. Only six of theten variables were signiﬁcant and used in the tree construction. These were: • faculty • age on enrolment • admission qualiﬁcation • admission degree • international student • prior publicationOn this tree, the variable used to split the data set is shown at each node. At each leaf orterminal node, the class split at the upper node is displayed with its completion probability.Note that although there were 445 students in the data set, 11 students had their admissionqualiﬁcations missing, 2 students had their admission degrees missing and a further four stu-dents had other data missing. These students were all removed from the data to be able toDBA6000: Quantitative Business Research Methods 132
- 133. Part 11. Further methodsestimate the tree structure. So there are 428 observations used in the tree.The following conclusions can be drawn from this analysis: • The most important variable is Faculty with BusEco, IT, Law, Medicine, Pharmacy and Science students having a higher completion probability than students from other fac- ulties. Arts, in particular, had a low completion probability than students from other faculties. • For Arts students, the next most important variable was age with young students (en- rolment age less than 22 years) having much higher completion rate than older students. For the older Arts students, international students performed better. • For students from BusEco, IT, Law, Medicine, Pharmacy and Science, the situation is more complex. Students with a publication had a higher completion rate, especially if that also had a Masters degree. Students without a publication did well if they had an H2A entry rather than an H1 entry. For students with no publication and an H1 entry, the older students (enrolment age greater than 23 years) did the worst. • The groups having worst performance are: – Arts students with completion probability of 0.47. Of these, students aged 22 or more on enrolment had completion probability of 0.45 (and only 0.41 for non-international students). • The groups having best performance are: – BusEco, IT, Law, Medicine, Pharmacy and Science students with a publication had completion probability of 0.88 (and 100% for Masters students with a publication). – Law, Pharmacy and Medicine students without a publication and over 23 on enrol- ment had a completion probability greater than 0.9.Further reading • B REIMAN , L., F RIEDMAN , J., O LSHEN , R., and S TONE , C., (1984) Classiﬁcation and regres- sion trees, Wadsworth & Cole: Belmont, CA.11.2 Structural equation modellingSets of linear equations used to specify phenomena in terms of presumed cause-and-effect vari-ables. Some variables can be unobserved.Further reading • S CHUMACKER , R.E., and L OMAX , R.G. (1996) A beginners guide to structural equation mod- eling, Hillsdale, N.J.: Lawrence Erlbaum Associates. • K LINE , R.B. (1998) Principles and practice of structural equation modeling, Guilford: New York.DBA6000: Quantitative Business Research Methods 133
- 134. Part 11. Further methods11.3 Time series modelsThese are models of time series data and are usually designed for forecasting. The most com-mon models: • exponential smoothing; • ARIMA (or Box-Jenkins) models; • VAR models (for modelling several time series simultaneously).Further reading • M AKRIDAKIS , S., W HEELWRIGHT, S., and H YNDMAN , R.J. (1998). Forecasting: methods and applications, John Wiley & Sons: New York. Chapters 4 and 7.11.4 Rank-based methodsReplacements for t-tests and ANOVA when the data are not normally distributed.Further reading • G IBBONS , J.D., and C HAKRABORTI , S. (2003). Nonparametric statistical inference, CRC Press.DBA6000: Quantitative Business Research Methods 134
- 135. CHAPTER 12 Presenting quantitative research12.1 Numerical tables • Give only as many decimal places as are accurate, meaningful and useful. • Make sure decimal points are aligned. • Use horizontal or vertical lines to help the reader make the desired comparisons. • Avoid giving all grid lines. • Give meaningful column and row headings. • Give a detailed caption. • A table should be as self-explanatory as possible. • Do readers really need to know all the values? • Replace large tables with graphs where possible. 135
- 136. Part 12. Presenting quantitative research12.2 Graphics12.2.1 The purpose of data-based graphics Data graphics are paragraphs about data. E.R. TufteCommunication: Graphs provide a visual summary of data. The intention is usually to display discovered patterns in the data.Analysis: Graphs reveal facts about data that are difﬁcult to detect from tables. The intention is usually to discover patterns in the data.There was initial resistance to the use of graphics for understanding data. In the 1800s, theRoyal Society requested that the automatically recorded graphs of an early weather clock be “reduce[d] into writing . . . that thereby the Society might have a specimen of the weather-clock’s performances before they proceed to the repairing of it.”12.2.2 Produce meaningful graphsA graph is designed to represent data using symbols such as the length of a line, or the distancefrom an axis. The graphical elements should be proportional to the data. Some people aretempted to over-decorate a graph and forget about the purpose of the graph: to show the data. • Keep graphical elements proportional to data. • Don’t let decoration destroy content. • Make graphs informative, not just impressive.12.2.3 Avoid design variationThe only element of a graph which should change is the element proportional to the data. Ifother elements also change with the data, the meaning of the graph is easily lost. • Keep the scales constant. • Keep non-data features constant. • Don’t use perspective as it is difﬁcult to distinguish changes in data with changes due to perspective. • Don’t plot one dimensional data with two or three dimensional symbols.12.2.4 Avoid scale distortionSome data need to be scaled before any meaningful comparisons can be made. For example,monetary values change due to inﬂation, monthly sales change due to the number of tradingdays in a month, government income from taxes changes with the population. • Scale for population changes • Scale for inﬂation • Scale for different time periods12.2.5 Avoid axis distortionBy changing an axis, a graph can be made to look almost ﬂat or wildly ﬂuctuating. Such ex-tremes should be avoided and those looking a graph should also look at the scales.DBA6000: Quantitative Business Research Methods 136
- 137. Part 12. Presenting quantitative researchSome graphs mislead by showing a small section of data rather than the whole context in whichthe data lie. A decrease in sales over a few months may be part of a long decreasing trend or amomentary drop in an long increasing trend.We are accustomed to interpreting graphs with the dependent variable (e.g. sales) on the ver-tical axis and the independent variable (e.g. time) on the horizontal axis. Flouting conventioncan be misleading.Some graphs attempt to show data over a wide range by using a broken axis. This also can bemisleading. If the data range is too wide for the graph, the data is probably on the wrong scale. • Look carefully at the axis scales • Show context • Keep dependent variable on vertical axis • Avoid broken axes12.2.6 Graphical excellenceGraphical displays should: • show the data • induce the viewer to think about the substance rather than about methodology, graphic design, the technology of graphic production, or something else. • make large data sets coherent • present many numbers in a small space • encourage the eye to compare different pieces of data • reveal the data at several levels of detail, from a broad overview to the ﬁne structure • serve a reasonably clear purpose: description, exploration, tabulation, or decoration • be closely integrated with the statistical and verbal descriptions of a data set.Graphical integrity is more likely to follow if these six principles are followed: • The representation of numbers, as physically measured on the surface of the graphic itself, should be directly proportional to the numerical quantities represented. • Clear, detailed, and thorough labeling should be used to defeat graphical distortion and ambiguity. Write out explanations of the data on the graphic itself. Label important events in the data. • Show data variation, not design variation. • In time-series displays of money, deﬂated and standardized units of monetary measure- ment are nearly always better than nominal units. • The number of information-carrying (variable) dimensions depicted should not exceed the number of dimensions in the data. • Graphics must not quote data out of context.We choose to use graphs for a very good reason. Human beings are well equipped to recogniseand process visual patterns. Much of the processing power of the human brain is dedicated tovisual information. Of all our senses, vision is the most dominant. When data are presentedgraphically, we can see the picture much more quickly than we would from the numbers them-selves. We also tend to see subtleties that would be invisible if the data were presented in atable.Good graphs are a very effective way to present quantitative information. Bad graphs can bemisleading, or at best, useless.DBA6000: Quantitative Business Research Methods 137
- 138. Part 12. Presenting quantitative research Graphical competence demands three quite different skills: the substantive, sta- tistical, and artistic. Yet now most graphical work, particularly at news publica- tions, is under the direction of but a single expertise—the artistic. Allowing artist- illustrators to control the design and content of statistical graphics is almost like allowing typographers to control the content, style and editing of prose. Substan- tive and quantitative expertise must also participate in the design of data graphics, at least if statistical integrity and graphical sophistication are to be achieved. E.R. Tufte12.2.7 Cleveland’s paradigmA graph encodes quantitative and categorical information using symbols, geometry and color.Graphical perception is the visual decoding of the encoded information. • The graph may be beautiful but a failure: the visual decoding has to work. • To make graphs work we need to understand graphical perception: what the eye sees and can decode well. • We need a paradigm that integrates statistical graphics and visual perception.Cleveland’s paradigm has been developed from intuition about works well in a graph, fromtheory and experiments in visual perception, and experiments in graphical perception. Thereare three basic elements: 1. A speciﬁcation and ordering of elementary graphical-perception tasks. 2. The role of distance in graphical perception. ˆ 3. The role of detection in graphical perception. ˆTen properties of graphs identiﬁed by Cleveland which relate to judgments we make to visuallydecode quantitative information from graphs. • Angle • Position along a common scale • Area • Position along identical, non-aligned • Colour hue scales • Colour saturation • Slope • Density (amount of black) • Volume • Length (distance)Cleveland’s conclusions are that there is an ordering in the accuracy with which we carry outthese tasks. The order, from most accurate to least accurate, is: 1. Position along a common scale 2. Position along identical, non-aligned scales 3. Length 4. Angle and slope 5. Area 6. Volume 7. Colour hue, colour saturation, densitySome of the tasks are tied in the list; we don’t have enough insight to determine which can bedone more accurately.This leads to the basic principle:DBA6000: Quantitative Business Research Methods 138
- 139. Part 12. Presenting quantitative research Encode data on a graph so that the visual decoding involves tasks as high as possible in the ordering.There are some qualiﬁcations: • It’s a guiding principle, not a rule to be slavishly followed; • Detection and distance have to be taken into account; they may sometimes override the basic principle.This paradigm implies the following, which is not a systematic list, but a number of examplesof the insights which follow from it. 1. Pie charts are not a good method of graphing proportions because they rely on comparing angles rather than distance. A better method is to plot proportions as a bar chart or dot chart. It is also easier to label a bar chart or dot chart than a pie chart. 2. Categorical data with a categorical explanatory variable are difﬁcult to plot. A common approach is to use a stacked bar chart. The difﬁculty here is that we need to compare lengths rather than distances. A better approach is the side-by-side bar chart which leads to distance comparisons. Ordering the groups can assist making comparisons. However, side-by-side bar charts can become very cluttered with several group variables. 3. Time series should be plotted as lines with time on the horizontal axis. This enables distance comparisons, emphasises the ordering due to time and allows several time series to be plotted on the same graph without visual clutter. 4. Avoid representing data using volumes. 5. If a key point is represented by a changing slope, consider plotting the rate of change itself rather than the original data. 6. Think of simpliﬁcations which enhance the detection of the basic properties of the data. 7. Think of how the distance between related representations of data affects their interpre- tation.12.2.8 Aesthetics and technique in data graphical designThoughts from Tufte: • Graphical elegance is often found in simplicity of design and complexity of data. • Graphs should tend towards the horizontal, greater in length than height (about 50% wider than taller). • There are many speciﬁc differences between friendly and unfriendly graphics:DBA6000: Quantitative Business Research Methods 139
- 140. Part 12. Presenting quantitative research Friendly Unfriendly words are spelled out, mysterious and abbreviations abound, requiring the elaborate encoding avoided viewer to sort through text to decode abbreviations words run from left to right, the usual di- words run vertically, particularly along rection for reading occidental languages the Y-axis; words run in several different directions little messages help explain data graphic is cryptic, requires repeated ref- erences to scattered text elaborately encoded shadings, cross- obscure codings require going back and hatching and colors are avoided; instead, forth between legend and graphic labels are placed on the graphic itself; no legend is required graphic attracts viewer, provokes curios- graphic is repellent, ﬁlled with chartjunk ity colors, if used, are chosen so that the design insensitive to color-deﬁcient view- color-deﬁcient and color-blind (5 to 10 ers; red and green used for essential con- percent of viewers) can make sense of the trasts graphic (blue can be distinguished from other colors by most color-deﬁcient peo- ple) type is clear, precise, modest type is clotted, overbearing type is upper-and-lower case, with serifs type is all capitals, sans serifReferences • C LEVELAND , W.S. (1985) The elements of graphing data, Wadsworth. • C LEVELAND , W.S. (1993) Visualizing data, Hobart Press • T UFTE , E.R. (1983) The visual display of quantitative information, Graphics Press. • T UFTE , E.R. (1990) Envisioning information, Graphics Press.DBA6000: Quantitative Business Research Methods 140
- 141. The Quality Magazine 7 #4 (1998), 64-68. Good Graphs for Better Business — Data — — — — — — — — — — — — — — — W. S. Cleveland & N.I. Fisher Analysis and Are we on track with our target for market share? Is Interpretationour installation process capable of meeting theindustry benchmark? What things are causing most ofthe unhappiness with our staff? Information These are typical of the management problems thatrequire timely and accurate information. They are also Encodingproblems for which effective use of graphs can make abig difference. 200 150 North ? 100 West 50 East 0 The good news is that graphical capabilities are now 1st 2nd 3rd 4th Qtr Qtr Qtr Qtrreadily available in statistical, spreadsheet, and word-processing packages. The bad news is that much ofthis graphical capability produces graphs that can hide Decodingor seriously distort crucial information contained inthe data. Decoded Information For a long time, how to make a good graph waslargely a matter of opinion. However, the last 20 yearshave seen the development of a set of principles forsound graphical construction, based on solid scientificresearch and experimentation. Good entry points to Business decisionthese principles are provided in the References. In thisarticle, we look at a couple of common examples ofapplying these principles. Figure 1. The graphical process involves extraction of information from data, a decision about which It is helpful to think about the whole process of patterns are to be displayed, and then selection of agraphing, shown schematically in Figure 1. The type of graph that will reveal this pattern to the user,crucial question is: How does the choice of graph without distortion.affect the information as perceived by the recipient ofthe graph? To see how graphs can conceal, or reveal,information, consider the humble pie chart. Figure 2shows data on the contributions to enterprise profits ofa particular product in various regions R1, R2, …around Australia (labels modified from original, butretaining the same ordering, which was alphabetical).What information is this supposed to be purveying?Certainly the caption doesn’t enlighten us. If wewanted the actual percentage share for each region, weshould simply use a table: tables are intended toprovide precise numerical data, whereas the purposeof graphs is to reveal pattern. For a more elaborate example, we turn to anotherpopular graphical display, the divided bar chart fortrend data. Figure 3 shows data on market share ofwhitegoods sales for an eighteen month period, based Figure 2. Pie chart, showing the relative contributionson monthly industry surveys. What can we glean from to the profits of an enterprise from various Divisionsthis? Total sales aren’t changing. Manufacturer 1 has around Australia.the biggest market share. Is there nothing else?Version 1.0, 28th May, 1998 1
- 142. The Quality Magazine 7 #4 (1998), 64-68. The answer to this is provided by one of the fundamental tenets of graphing. Detection of pattern with this type of data is best done when each measurement is plotted as a distance from a common baseline. The baseline in Figure 4 is the left vertical axis, and we’re simply comparing horizontal lengths that are vertically aligned. On the other hand, in the pie chart, we’re trying to compare angles, and very small angles at that. This sort of comparison is known to be imprecise. Similar problems occur when the data are graphed in other colourful or picturesque ways, such as when their sizes are represented by 3-dimensional solid volumes. However, we haven’t finished with this data set yet. At least one more step should be taken: re-order the data so that they plot from largest to smallest. The final result is shown in Figure 5.Figure 3. Divided bar chart, showing monthly sales ofdifferent brands of whitegoods over an 18-monthperiod. Returning to Figure 2, no obvious patterns emerge.So what’s happened in the graphical process? Perhapsthe Analysis and Interpretation step wasn’t carriedout. So, let’s try a different plot that shows thingsmore simply, a dotplot: see Figure 4. Figure 5. The display from Figure 4 has been modified, so that the data plot from largest to smallest. A further pattern emerges: different States tend to contribute differently to enterprise profits. A (potentially) more important pattern has now emerged: different States are not contributing equally to group profits. This information may be vital in helping management to identify a major improvement opportunity. We create one more graph to bring this out: see Figure 6.Figure 4. A dotplot of the data used in Figure 2, What can we now say in defence of Figure 2 in theshowing the relative contributions to enterprise profits light of Figures 5 and 6? Really, only that it wasfrom its various Divisions around Australia. The produced from a spreadsheet at the press of a button.discrete nature of the data is immediately evident. But judged by a standard of how well it conveys information, it has failed. This is typical of pie charts. A simple pattern emerges immediately: there areonly five different levels of contribution; some Now let’s re-visit the market share data plotted inrounding of the raw data has been performed. Why Figure 3. What aspects of this graph might bewasn’t this evident in the pie chart? hindering or preventing us from seeing important pattern?Version 1.0, 28th May, 1998 2
- 143. The Quality Magazine 7 #4 (1998), 64-68.Figure 6. The contributions to group profit bydifferent regions are plotted by State. The cleardifferences between States are evident. We can get some idea of what’s happening with thegoods sold by Manufacturer 1: probably not much. We Figure 8. The monthly sales data from Figure 7 havesee this pattern because we’re effectively comparing been replotted so that the dominant curve is displayedaligned (vertical) lengths: the 18 values of monthly separately, with a false origin, and the other curvessales for this Manufacturer are all measured up from a that are measured on a much smaller scale can then becommon baseline (the horizontal axis). plotted using a better aspect ratio. This reveals more information about individual and comparative trends However, this isn’t the case for any of the other in the curves.variables. For example, the individual values for thesecond manufacturer are measured upwards from We have made some progress. Manufacturer 1 iswhere the corresponding values for the first more easily studied, and there is little evidence ofmanufacturer stop: the lengths are not aligned. So the anything other than random fluctuations around anfirst step is to re-plot the data in order that, for each average monthly sales volume of about 25,000 units.variable, we are comparing aligned lengths. See However, possible patterns in the other curves areFigure 7. difficult to detect because the scale of the axes is adjusted to allow all the data to be plotted on the same graph. The next step is to display the dominant curve separately, and to use a better aspect ratio when plotting the other variables. See Figure 8. Now some very interesting patterns emerge. It is evident that sales for Manufacturer 2 have been in decline for most of this period. Not much is happening with Manufacturer 4, who is averaging about 2500 units. However, there is something happening with Manufacturers 3 and 5. From months 6 to 18, sales for Manufacturer 3 rose significantly and then declined. This appears to have been at the expense of Manufacturer 5, whose sales declined significantly and then recovered. If this comparison is really of interest, it should be studied separately. The difference is plotted in Figure 9.Figure 7. The monthly sales data from Figure 3 havebeen replotted so that sales patterns for eachmanufacturer can be seen without distortion.However, the curve for the dominant manufacturer iscompressing patterns in the other curves.Version 1.0, 28th May, 1998 3
- 144. The Quality Magazine 7 #4 (1998), 64-68. • two very commonly-used displays, pie charts and divided bar charts, typically do a poor job of revealing pattern Can you afford not to be using graphs in the way information is reported to you, or the way you are reporting it? How else can vital patterns be revealed and presented, and so provide effective input to decision-making at all levels of your enterprise? References Cleveland, William S. (1993), Visualizing Data. Hobart Press, Summit, New Jersey. Cleveland, William S. (1994), The Elements of Graphing Data. Second edition. Hobart Press, Summit, New Jersey.Figure 9. This graph shows the difference between Tufte, Edward R. (1983). The Visual Display ofsales of Manufacturers 5 and 3. Over the period Quantitative Information. Graphics Press, Cheshire,January – July 1997 there was a marked increase in Connecticut.sales in favour of Manufacture 5; after July, thisadvantage declined steadily to the end of the year. Bill Cleveland is a member of the Statistics & Data One plausible explanation would be a short but Mining Research Department of Bell Labs, Murrayintense marketing campaign conducted during this Hill, NJ, USA.period; there may be others. The main point is thatappropriate graphs have elicited very interesting Nick Fisher is a member of the Organisationalpatterns in the data that may well be worthy of further Performance Measurement and Data Mining researchexploration. The divided bar chart has done poorly; groups in CSIRO.this is typically the case. There is much to be learnt about constructing goodgraphs, to do with effective use of colour, choice ofaspect ratio, displaying large volumes of data, use offalse origin, and so on. These issues are discussed atlength in the references. To summarise, what basic messages can we drawfrom these simple examples? There are several: • the process of effective graph construction begins with simple analyses to see what sorts of patterns, that is, information, are present. • for many graphs, pattern detection is far more acute when the data are measured from a common baseline, so that we are comparing aligned lengths • re-arranging the values in a dot plot so that they are in decreasing order provides greatly enhanced pattern recognition • graph construction is an iterative process • sometimes more than one graph is needed to show all the interesting patterns in the best wayVersion 1.0, 28th May, 1998 4
- 145. CHAPTER 13 Readings24 July 2008 Claver and Quer (2001)31 July 2008 Baugher, Varanelli and Weisbord (2000)14 August 2008 Kanazawa and Kovar (2004)21 August 2008 Yilmaz and Chatterjee (2003)4 September 2008 Lanen (1999)11 September 2008 Rozell, Pettijohn and Parker (2002)18 September 2008 Lewis (2001) 145

Be the first to comment