354 G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 – 375brief review only. The Delphi technique was de- invalid criteria (such as the status of an idea’sveloped during the 1950s by workers at the RAND proponent). Furthermore, with the iteration of theCorporation while involved on a U.S. Air Force questionnaire over a number of rounds, the indi-sponsored project. The aim of the project was the viduals are given the opportunity to change theirapplication of expert opinion to the selection – from opinions and judgments without fear of losing face inthe point of view of a Soviet strategic planner – of the eyes of the (anonymous) others in the group.an optimal U.S. industrial target system, with a Between each questionnaire iteration, controlledcorresponding estimation of the number of atomic feedback is provided through which the group mem-bombs required to reduce munitions output by a bers are informed of the opinions of their anonymousprescribed amount. More generally, the technique is colleagues. Often feedback is presented as a simpleseen as a procedure to ‘‘obtain the most reliable statistical summary of the group response, usuallyconsensus of opinion of a group of experts . . . by a comprising a mean or median value, such as theseries of intensive questionnaires interspersed with average ‘group’ estimate of the date by when ancontrolled opinion feedback’’ (Dalkey & Helmer, event is forecast to occur. Occasionally, additional1963, p. 458). In particular, the structure of the information may also be provided, such as argumentstechnique is intended to allow access to the positive from individuals whose judgments fall outside cer-attributes of interacting groups (knowledge from a tain pre-speciﬁed limits. In this manner, feedbackvariety of sources, creative synthesis, etc.), while comprises the opinions and judgments of all grouppre-empting their negative aspects (attributable to members and not just the most vocal. At the end ofsocial, personal and political conﬂicts, etc.). From a the polling of participants (i.e., after several roundspractical perspective, the method allows input from a of questionnaire iteration), the group judgment islarger number of participants than could feasibly be taken as the statistical average (mean / median) of theincluded in a group or committee meeting, and from panellists’ estimates on the ﬁnal round. The ﬁnalmembers who are geographically dispersed. judgment may thus be seen as an equal weighting of Delphi is not a procedure intended to challenge the members of a staticized group.statistical or model-based procedures, against which The above four characteristics are necessary deﬁn-human judgment is generally shown to be inferior: it ing attributes of a Delphi procedure, although thereis intended for use in judgment and forecasting are numerous ways in which they may be applied.situations in which pure model-based statistical The ﬁrst round of the classical Delphi proceduremethods are not practical or possible because of the (Martino, 1983) is unstructured, allowing the in-lack of appropriate historical / economic / technical dividual experts relatively free scope to identify, anddata, and thus where some form of human judg- elaborate on, those issues they see as important.mental input is necessary (e.g., Wright, Lawrence & These individual factors are then consolidated into aCollopy, 1996). Such input needs to be used as single set by the monitor team, who produce aefﬁciently as possible, and for this purpose the structured questionnaire from which the views, opin-Delphi technique might serve a role. ions and judgments of the Delphi panellists may be Four key features may be regarded as necessary elicited in a quantitative manner on subsequentfor deﬁning a procedure as a ‘Delphi’. These are: rounds. After each of these rounds, responses areanonymity, iteration, controlled feedback, and the analysed and statistically summarised (usually intostatistical aggregation of group response. Anonymity medians plus upper and lower quartiles), which areis achieved through the use of questionnaires. By then presented to the panellists for further considera-allowing the individual group members the oppor- tion. Hence, from the third round onwards, panelliststunity to express their opinions and judgments are given the opportunity to alter prior estimates onprivately, undue social pressures – as from dominant the basis of the provided feedback. Furthermore, ifor dogmatic individuals, or from a majority – should panellists’ assessments fall outside the upper orbe avoided. Ideally, this should allow the individual lower quartiles they may be asked to give reasonsgroup members to consider each idea on the basis of why they believe their selections are correct againstmerit alone, rather than on the basis of potentially the majority opinion. This procedure continues until
G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 – 375 355a certain stability in panellists’ responses is achieved. of controlled experimentation or academic inves-The forecast or assessment for each item in the tigation that has taken place . . . (and that) asidequestionnaire is typically represented by the median from some RAND studies by Dalkey, most ‘evalua-on the ﬁnal round. tions’ of the technique have been secondary efforts An important point to note here is that variations associated with some application which was thefrom the above Delphi ideal do exist (Linstone, primary interest’’ (p. 11). Indeed, the relative paucity1978; Martino, 1983). Most commonly, round one is of evaluation studies noted by Linstone and Turoff isstructured in order to make the application of the still evident. Although a number of empirical exami-procedure simpler for the monitor team and panel- nations have subsequently been conducted, the bulklists; the number of rounds is variable, though of Delphi references still concern applications ratherseldom goes beyond one or two iterations (during than evaluations. That is, there appears to be awhich time most change in panellists’ responses widespread assumption that Delphi is a useful instru-generally occurs); and often, panellists may be asked ment that may be used to measure some kind of truthfor just a single statistic – such as the date by when and it is mainly represented in the literature in thisan event has a 50% likelihood of occurring – rather vein. Consideration of what the evaluation studiesthan for multiple ﬁgures or dates representing de- actually tell us about Delphi is the focus of thisgrees of conﬁdence or likelihood (e.g., the 10% and paper.90% likelihood dates), or for written justiﬁcations ofextreme opinions or judgments. These simpliﬁcationsare particularly common in laboratory studies and 4. Evaluative studies of Delphihave important consequences for the generalisabilityof research ﬁndings. We attempted to gather together details of all published (English-language) studies involving evaluation of the Delphi technique. There were a3. The study of Delphi number of types of studies that we decided not to include in our analysis. Unpublished PhD theses, Since the 1950s, use of Delphi has spread from its technical reports (e.g., of the RAND Corporation),origins in the defence community in the U.S.A. to a and conference papers were excluded because theirwide variety of areas in numerous countries. Its quality is less assured than peer-reviewed journalapplications have extended from the prediction of articles and (arguably) book chapters. It may also belong-range trends in science and technology to argued that if the studies reported in these excludedapplications in policy formation and decision mak- formats were signiﬁcant and of sufﬁcient quality thening. An examination of recent literature, for example, they would have appeared in conventional publishedreveals how widespread is the use of Delphi, with form.applications in areas as diverse as the health care We searched through nine computer databases:industry (Hudak, Brooke, Finstuen & Riley, 1993), ABI Inform Global, Applied Science and Technolo-marketing (Lunsford & Fussell, 1993), education gy Index, ERIC, Transport, Econlit, General Science(Olshfski & Joseph, 1991), information systems Index, INSPEC, Socioﬁle, and Psychlit. As search(Neiderman, Brancheau & Wetherbe, 1991), and terms we used: ‘Delphi and Evaluation’, ‘Delphi andtransportation and engineering (Saito & Sinha, Experiment’, and ‘Delphi and Accuracy’. Because1991). our interest is in experimental evaluations of Delphi, Linstone and Turoff (1975) characterised the we excluded any ‘hits’ that simply reported Delphigrowth of interest in Delphi as from non-proﬁtmak- as used in some application (e.g., to ascertain experting organisations to government, industry and, ﬁnal- views on a particular topic), or that omitted keyly, to academe. This progression caused them some details of method or results because Delphi per seconcern, for they went on to suggest that the was not the focus of interest (e.g., Brockhoff, 1984).‘explosive growth’ in Delphi applications may have Several other studies were excluded because weappeared ‘‘ . . . incompatible with the limited amount believe they yielded no substantial ﬁndings, or
356 G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 – 375because they were of high complexity with uncertain authors whether there were any evaluative studiesﬁndings. For example, a paper by Van Dijk (1990) that we had overlooked. The eight replies caused useffectively used six different Delphi-like techniques to make two minor changes to our tables. None ofthat differed in terms of the order in which each of the replies noted evaluative studies of which we werethree forms of panellist interaction were used in each unaware.of three ‘Delphi’ rounds, resulting in a highly Details of the Delphi evaluative studies are sum-complex, difﬁcult to encode and, indeed, difﬁcult to marised in Tables 1 and 2. Table 1 describes theinterpret study. characteristics of each experimental scenario, includ- The ﬁnal type of studies that we excluded from ing the nature of the task and the way in whichour search ‘hits’ were those that considered the Delphi was constituted. Table 2 reports the ﬁndingsuniversal validity or reliability of Delphi without of the studies plus ‘additional comments’ (which alsoreference to control conditions or other techniques. concern details in Table 1). Ideally, the two tablesFor example, the study of Ono and Wedermeyer should be joined to form one comprehensive table,(1994) reported that a Delphi panel produced fore- but space constraints forced the division.casts that were over 50% accurate, and used this as We have tried to make the tables as comprehen-‘evidence’ that the technique is somehow a valid sive as possible without obscuring the main ﬁndingspredictor of the future. However, the validity of the and methodological features by descriptions of minortechnique, in this sense, will depend as much on the or coincidental results and details, or by repetition ofnature of the panellists and the task as on the common information. For example, a number oftechnique itself, and it is not sensible to suggest that Delphi studies used post-task questionnaires to assessthe percentage accuracy achieved here is in any way panellists’ reactions to the technique, with individualgeneralisable or fundamental (i.e., to suggest that questions ranging from how ‘satisﬁed’ subjects feltusing Delphi will generally result in 50% correct about the technique, to how ‘enjoyable’ they found itpredictions). The important questions – not asked in (e.g., Van de Ven & Delbecq, 1974; Scheibe et al.,this study – are whether the application of the 1975; Rohrbaugh, 1979; Boje & Murnighan, 1982).technique helped to improve the forecasting of the Clearly, to include every analysis or signiﬁcantindividual panellists, and how effective the technique correlation involving such measures would result inwas relative to other possible forecasting procedures. an unwieldy mass of data, much of which has littleSimilarly, papers by Felsenthal and Fuchs (1976), theoretical importance and is unlikely to be repli-Dagenais (1978), and Kastein, Jacobs, van der Hell, cated or considered in future studies. We have alsoLuttik and Touw-Otten (1993) – which reported data been constrained by lack of space: so, while we haveon the reliability of Delphi-like groups – are not attempted to indicate how aspects such as theconsidered further, for they simply demonstrate that dependent variables were measured (Table 2), wea degree of reliability is possible using the technique. have generally been forced to give only brief de- The database search produced 19 papers relevant scriptions in place of the complex scoring formulaeto our present concern. The references in these used. In other cases where descriptions seem overlypapers drew our attention to a further eight studies, brief, the reader is referred to the main text whereyielding 27 studies in all. Roughly following the matters are explained more fully.procedure of Armstrong and Lusk (1987), we sent Regarding Table 1, the studies have been classi-out information to the authors of these papers. We ﬁed according to their aims and objectives as ‘Appli-identiﬁed the current addresses of the ﬁrst-named cation’, ‘Process’ or ‘Technique-comparison’ type.authors of 25 studies by using the ISI Social Citation Application studies use Delphi in order to gaugeIndex database. We produced ﬁve prototype tables expert opinion on a particular topic, but considersummarising the methods and ﬁndings of these issues on the process or quality of the technique as aevaluative papers (discussed shortly), which we secondary concern. Process studies focus on aspectsasked the authors to consider and comment upon, of the internal features of Delphi, such as the role ofparticularly with regard to our coding and interpreta- feedback, the impact of the size of Delphi groups,tion of each author’s own paper. We also asked the and so on. Technique-comparison studies evaluate
Table 1Summary of the methodological features of Delphi in experimental studies aStudy Study Delphi Rounds Nature of Delphi Nature of Task Incentives G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 – 375 type group size feedback subjects offered?Dalkey and Helmer (1963) Application 7 5 Individual estimates Professionals: e.g., economists, Hypothetical event: number of No (process) (bombing schedules) systems analysts, electronic bombs to reduce munitions engineer output by stated amountDalkey, Brown and Cochran (1970) Process 15–20 2 Unclear Students: no expertise on task Almanac questions (20 per No group, 160 in total)Jolson and Rossow (1971) Process 11 and 14 3 Medians (normalised Professionals: computing corporation Forecast (demand for classes: No probability medians) staff, and naval personnel one question) Almanac (two questions)Gustafson, Shukla, Delbecq and Walster (1973) Technique 4 2 Individual estimates Students: no expertise on task Almanac (eight questions requiring No comparison (likelihood ratios) probability estimates in the form of likelihood ratios)Best (1974) Process 14 2 Medians, range of estimates, Professionals: faculty members Almanac (two questions requiring No distributions of expertise of college numerical estimates of known ratings, and, for one group, quantity), and Hypothetical event reasons (one probability estimation)Van de Ven and Delbecq (1974) Technique 7 2 Pooled ideas of group Students (appropriate expertise), Idea generation (one item: deﬁning No comparison members Professionals in education job description of student dormitory counsellors)Scheibe, Skutsch and Schofer (1975) Process Unspeciﬁed 5 Means, frequency Students (uncertain expertise Goals Delphi (evaluating No (possibly 21) distribution, reasons (from on task) hypothetical transport facility Ps distant from mean) alternatives via rating goals)Mulgrave and Ducanis (1975) Process 98 3 Medians, interquartile Students enrolled in Educational Almanac (10 general, and 10 No range, own responses to Psychology class (most of related to teaching) and Hypo- previous round whom were school teachers) thetical (18 on what values in U.S. will be, and then should be) 357
358Table 1. ContinuedStudy Study Delphi Rounds Nature of Delphi Nature of Task Incentives type group size feedback subjects offered?Brockhoff (1975) Technique 5, 7, 9, 11 5 Medians, reasons (of Professionals: banking Almanac and Forecasting Yes comparison, those with responses (8–10 items on ﬁnance, banking, Process outside of quartiles) stock quotations, foreign trade, for each type)Miner (1979) Technique 4 Up to 7 Not speciﬁed Students: from diverse Problem Solving (role-playing No comparison undergrad populations exercise assigning workers to subassembly positions)Rohrbaugh (1979) Technique 5–7 3 Medians, ranges Students (appropriate expertise) Forecasting (college performance No comparison re grade point average of 40 freshmen)Fischer (1981) Technique 3 2 Individual estimates Students (appropriate expertise) Giving probability of grade point Yes comparison (subjective probability averages of 10 freshmen falling distributions) in four rangesBoje and Murnighan (1982) Technique 3, 7, 11 3 Individual estimates, Students (no expertise on task) Almanac (two questions Jupiter and No comparison one reason from each P Dollars) and Subjective Likelihood (process) (two questions: Weight and Height)Spinelli (1983) Application 20 4 Modes (via histogram), Professionals: education Forecasting (likelihood of 34 No (process) (24 initially) distributions, reasons education-related events occurring in 20 years, on 1–6 scale) ´Larreche and Moinpour (1983) Technique 5 4 Means, lowest and MBA students (with some Forecasting (market share of No comparison highest estimates, expertise in business area product based on eight communications reasons of task) plans; historical data provided)Riggs (1983) Technique 4–5 2 Means (of point spreads) Students Forecasting (point spreads of two No G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 – 375 comparison college football games taking place 4 weeks in future) ´Parente et al. (1984) Process 80 2 Percentage of group who Students Forecasting (30 scenarios on No predicted event to occur, economic, political and median time scale military events)Erffmeyer and Lane (1984) Technique 4 6 Individual estimates Students (no expertise on task) Hypothetical event (Moon Survival No comparison (ranks), reasons Problem, involving ranking 15 items for survival value)
Erffmeyer, Erffmeyer and Lane (1986) Process 4 6 Individual estimates Students (no expertise on task) Hypothetical event (Moon Survival No (ranks), reasons Problem, involving ranking 15 items for survival value)Dietz (1987) Process 8 3 Mean, median, upper and Students: state government Forecasting (proportion of No lower quantities, plus analysts or employed in citizens voting yes on ballot justiﬁcations on r3 public sector ﬁrms in California election)Sniezek (1989) Technique 5 Varied Medians Students (no expertise on task) Forecasting (of dollar amount of Yes comparison #5 next month’s sales at campus store re four categories of items)Sniezek (1990) Technique 5 ,6 Medians Students (no expertise on task) Forecasting (ﬁve items concerned Yes comparison with economic variables: forecasts were for 3 months in the future)Taylor, Pease and Reid (1990) Process 59 3 Individual responses (i.e. Elementary school Policy (identifying the qualities of No chosen characteristics) teachers effective inservice program: determining top seven characteristics)Leape, Freshour, Yntema and Hsiao (1992) Technique 19, 11 2 and 3 Distribution of Surgeons Policy (judged intraservice work No comparison numerical evaluations required to perform each of 55 of each of the 55 items services, via magnitude estimation)Gowan and McNichols (1993) Process 15 in r1 3 Response frequency or Bank loan ofﬁcers and Hypothetical event (evaluation of No (application) regression weights or small business consultants likelihood of a company’s induced (if-then) rules economic viability in future)Hornsby, Smith and Gupta (1994) Technique 5 3 Statistical average of Students (no particular expertise Policy (evaluating four jobs using the No comparison points for each factor but some task training) nine-factor Factor Evaluation System)Rowe and Wright (1996) Process 5 2 Medians, individual Students (no particular expertise Forecasting (of 15 political No numerical responses, though items selected to be and economic events for reasons pertinent to them) 2 weeks into future) G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 – 375 a *Analysis signiﬁcant at the P , 0.05 level; NS, analysis not signiﬁcant at the P , 0.05 level (trend only); NDA, no statistical analysis of the noted relationship was reported; ., greater/better than; 5, no (statistically signiﬁcant)difference between; D, Delphi; R, ﬁrst round average of Delphi polls; S, staticized group approach (average of non-interacting set of individuals). When used, this indicates that separate staticized groups were used instead of (or inaddition to) the average of the individual panellists from the ﬁrst round Delphi polls (i.e., instead of/in addition to ‘R’); I, interacting group; NGT, Nominal Group Technique; r, ‘round’; P, ‘panellist’; NA, not applicable. 359
360Table 2Results of the experimental Delphi studies aStudy Independent Dependent Results of technique Other results, Additional comments variables variables comparisons e.g. re processes (and levels)b,c (and measures)Dalkey and Helmer (1963) Rounds Convergence (difference in NA Increased convergence (consensus) A complex design where quantitative feedback estimates over rounds) over rounds (NDA) from panellists was only supplied in round 4, and from only two other panellistsDalkey et al. (1970) Rounds Accuracy (logarithm of median group Accuracy: D.R (NDA) High self-rated knowledgeability was Little statistical analysis reported. estimate divided by true answer) positively related to accuracy (NDA) Little detail on experimental method, e.g. no clear Self-rated knowledge (1–5 scale) description of round 2 procedure, or feedbackJolson and Rossow (1971) Expertise (experts vs. non-experts) Accuracy (numerical estimates on the Accuracy: D.R (experts) (NDA) Increased consensus over rounds*. Often reported study. Used a small sample, Rounds two almanac questions only). Accuracy: D,R (non-experts) (NDA) Claimed relationship between high r1 accuracy so little statistical analysis was possible Consensus (distribution of P responses) and low variance over rounds (‘signiﬁcant’)Gustafson et al. (1973) Techniques (NGT, I, D, S) Accuracy (absolute mean percentage Accuracy: NGT.I.S.D (* for techniques) NA Information given to all panellists to control Rounds error). Accuracy: D,R (NS) for differences in knowledge. This approach Consensus (variance of estimates Consensus: NGT5D5I5S seems at odds to rationale of using Delphi with across groups) heterogenous groupsBest (1974) Feedback (statistic vs. reasons) Accuracy (absolute deviation of Accuracy: D.R* High self-rated expertise positively related Self-rated expertise was used as both a Self-rated expertise (high vs. low) assessments from true value). Accuracy: D(1reasons).D(no reasons)* to accuracy, and sub-groups of self-rated dependent and independent variable Rounds Self-rated expertise (1–11 scale) (for one of two items) experts more accurate than non-experts*Van de Ven and Delbecq (1974) Techniques (NGT, I, D) Idea quantity (number of unique ideas). Number of ideas: D.I*; NGT.D (NS) NA Two open-ended questions were also asked Satisfaction (ﬁve items rated on 1–5 scales). Satisfaction: NGT.D*, I* about likes and dislikes concerning the Effectiveness (composite measure of idea Effectiveness: NGT*, D*.I procedures quantity and satisfaction, equally weighted)Scheibe et al. (1975) Rating scales (ranking, rating- Conﬁdence (assessed by seven item NA Little evidence of correlation between response Poor design meant nothing could be concluded scales method, and pair comparisons) questionnaire) change and low conﬁdence, except in r2(*). about consistency of using different rating scales. G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 – 375 Rounds Change (ﬁve aspects, e.g. from r1 to r2) False feedback initially drew responses to it (NDA) Missing information on design detailsMulgrave and Ducanis (1975) Expertise (e.g., on items about Change (% changing answers) NA High dogmatism related to high change over A brief study, with little discussion of the teaching Ps considered expert, Dogmatism (on Rokeach Dogmatism rounds*, most evident on items on which the unexpected/counter-intuitive results on general items they non-expert) Scale) most dogmatic were relatively inexpert Rounds
Brockhoff (1975) Techniques (D, I) Accuracy (absolute error) Accuracy: D.I (almanac items) (NDA) No clear results on inﬂuence of group size. Signiﬁcant differences in accuracy of groups Group size (5, 7, 9, 11, D; 4, 6, 8, 10, I) Self-rated expertise (1–5 scale) Accuracy: I.D (forecast items) (NDA) Self-rated expertise not related to accuracy of different sizes were due to differences in Item type (forecasting vs. almanac) Accuracy: D.R (NS) initial accuracy of groups at round 1 Rounds (Accuracy increased to r3; then decreased)Miner (1979) Techniques (NGT, D, PCL) Accuracy/Quality (index from 0.0 to 1.0) Accuracy: PCL.NGT.D (NS) NA Effectiveness was determined by the product Acceptance (11 statement q’naire) Acceptance: D5NGT5PCL of quality (accuracy) and acceptance Effectiveness (product of other indexes) Effectiveness: PCDL.D*, NGT*Rohrbaugh (1979) Techniques (D, SJ) Accuracy (correlation with solution) Accuracy: D5SJ Quality of decisions of individuals increased only Delphi groups instructed to reduced existing Post-group agreement (difference between Post-group agreement: SJ.D (‘signiﬁcant’) slightly after the group Delphi process differences over rounds. Agreement concerns group and individual post-group policy) Satisfaction: SJ.D (‘signiﬁcant’)* degree to which panellists agreed with each other Satisfaction (0–10 scale) after group process had ended (post-task rating)Fischer (1981) Techniques (NGT, D, I, S) Accuracy (according to truncated Accuracy: D5NGT5I5S NA logarithmic scoring rule) (Procedures Main Effect: P . 0.20)Boje and Murnighan (1982) Techniques (NGT, D, Iteration) Accuracy (deviation of gp mean from Accuracy: R.D, NGT No relationship between group size and Compared Delphi to an NGT procedure and to Group size (3, 7, 11) correct answer) Accuracy: Iteration procedure.R accuracy or conﬁdence. an iteration (no feedback) procedure Rounds Conﬁdence (1–7 scale) (Procedures3Trials was signiﬁcant)* Conﬁdence in Delphi increased over rounds* Satisfaction (responses on 10 item q’naire)Spinelli (1983) Rounds Convergence (e.g., change in NA No convergence over rounds No quoted statistical analysis. interquartile range) In r3 and r4, Ps encouraged to converge towards majority opinion! ´Larreche and Moinpour (1983) Techniques (D, I) Accuracy (Mean Absolute Percentage Error) Accuracy: D.R* Expertise assessed by self-rating had no signiﬁcant A measure of conﬁdence was taken as synonymous Rounds Expertise (via (a) Conﬁdence in estimates, Accuracy: D.I* validity, but did when assessed by an with expertise. Evidence that composing (b) external measure) Accuracy: R5I external measure.* Aggregate of latter ‘experts’ Delphi groups of the best (externally-rated) increased accuracy over other Delphi groups* experts may improve judgementRiggs (1983) Techniques (D, I) Accuracy (absolute difference between Accuracy: D.I* More accurate predictions were provided for the Comparison with ﬁrst round not reported. Information on task (high vs. low) predicted and actual point spreads) high information game.* Delphi was more Standard information on the four teams in the accurate in both high and low information cases two games was provided ´Parente, Anderson, Feedback (yes vs. no) Accuracy (IF an event would occur; Accuracy: D.R (WHEN event occurs) Decomposition of Delphi seemed to show Three related studies reported: data here on ex3. G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 – 375Myers and O’Brien (1984) Iteration (yes vs. no) and error in days re WHEN an event Accuracy: D5R (IF event occurs) improvement in accuracy related to the iteration Suggestion that feedback superﬂuous in Delphi, would occur) component, not feedback (WHEN item) however, feedback used here was superﬁcialErffmeyer and Lane (1984) Techniques (NGT, D, I, Group Quality (comparison of rankings to ‘correct Quality: D.NGT*, I*, Guided gp* NA Acceptance bears similarity to the post-task procedure with guidelines ranks’ assigned by NASA experts) Quality: D.R*, also note: I.NGT* consensus (‘agreement’) measure of on appropriate behavior) Acceptance (degree of agreement between Acceptance: Guided gp procedure.D* Rohrbaugh (1979) Rounds an individual’s ranking and ﬁnal grp ranking) 361
362Table 2. ContinuedStudy Independent Dependent Results of technique Other results, Additional comments variables variables comparisons e.g. re processes (and levels)b,c (and measures)Erffmeyer et al. (1986) Rounds Quality (comparison of rankings to ‘correct Quality: D.R* Quality increased up to the fourth round, Is this an analysis of data from the study of ranks’ assigned by NASA experts) but not thereafter: ‘stability’ was achieved Erffmeyer and Lane (1984)?Dietz (1987) Weighting (using conﬁdence vs. no) Accuracy (difference between forecast Accuracy D.R (NS) Weighted forecasts less accurate than non- Very small study with only one group of eight Summary measure (median, mean, value and actual percentage) weighted ones, though difference small. panellists, and hence it is not surprising that Hampel, biweight, trimmed mean) Conﬁdence (1–5 scale) Differences in summary methods insigniﬁcant. statistically signiﬁcant results were not achieved Rounds Least conﬁdent not most likely to change opinionsSniezek (1989) Techniques (D, I, Dictator, Dialectic) Accuracy (absolute percentage error) Accuracy: Dictator, D.I (NDA) High conﬁdence and accuracy were correlated in Inﬂuence measure concerned extent to which an (repeated-measures design) Conﬁdence (1–7 scale) Accuracy: D.R (NS) Delphi*. Inﬂuence in Delphi was correlated individual’s estimates drew the group judgment. Rounds Inﬂuence (difference between individual’s with conﬁdence* and accuracy* Time-series data supplied. judgment and r1 group mean)Sniezek (1990) Techniques (D, I, S, Best Accuracy (e.g., mean forecast error) Accuracy: D.BM (one of ﬁve items) (P , 0.10) Conﬁdence measures were not correlated to All individuals were given common information Member (BM)) Conﬁdence (difference between upper and Accuracy: S.D (one of ﬁve items) (P , 0.10) accuracy in Delphi groups (time-series data) prior to the task. Item difﬁculty (mean % errors) lower limits of Ps 50% conﬁdence intervals) No. sig. differences in other comparisons Item difﬁculty measured post taskTaylor et al. (1990) None Survivability (degree to which r1 contribution NA No relationship between demographic Abandonment and Survivability relate to of each P is retained in later rounds) characteristics of panellists (gender, education, inﬂuence and opinion change Abandonment (degree of Ps abandoning r1 interest, and effectiveness and contributions in favour of those of others) survivability/abandonment) Demographic factors (four types)Leape et al. (1992) Techniques (D, single round mail Agreement (compared mean of adjusted Agreement: Mail survey.D.Modiﬁed D NA Agreement was measured by comparing panel survey, moderated D with panel ratings to that gained in national survey) (Modiﬁed D included panel discussion) estimates to those from a national survey. Not a discussion) true measure of accuracyGowan and McNichols (1993) Feedback (statistical, regression Consensus (pairwise subject agreement NA If-then rule feedback gave greater consensus Accuracy was measured, but not reported in model, if-then rule feedback) after each round of decisions) than statistical or regression model feedback* this paperHornsby et al. (1994) Techniques (NGT, D, Consensus) Degree of opinion change (from averaged Opinions changed from initial evaluations NA No measure of accuracy was possible. G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 – 375 individual evaluations to ﬁnal gp evaluation) during NGT* and consensus*, but not Delphi. Consensus involves discussion until all group Satisfaction (12 item q’naire: 1–5 scale) Satisfaction: D.consensus*, NGT* members accept ﬁnal decisionRowe and Wright (1996) Feedback (statistical, reasons, Accuracy (Mean Absolute Percentage Error) Accuracy: D(reas), D(start), Iterate.R (all*) Best objective experts changed least in the two Although subjects changed their forecasts less in iteration – without feedback) Opinion change (in terms of z-score measure) Accuracy: D(reas).D(start).Iteration (NS) feedback conditions*. the reasons condition, when they did so the (repeated measures design) Conﬁdence (each item, 1–7 scale) Change: Iteration.D(start), D(reas)* High self-rated experts were most accurate on r1* new forecasts were more accurate.* This trend Rounds Self-rated expertise (one measure, 1–7 scale) but no relationship to change variable. did not occur in the other conditions, suggesting Objective expertise of Ps (via MAPES) Conﬁdence increased over rounds in all conditions* reasons feedback helped discriminality of change a See Table 1 for explanation of abbreviations. b For the number of rounds (i.e. levels of independent variable ‘Round’), see Table 1. c In many of the studies, a number of different questions are presented to the subjects so that ‘items’ may be considered to be an independent variable. Usually, however, there is no speciﬁc rationale behind analysing responses to the different items, so we do not consider ‘items’ an I.V. here.
G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 – 375 363Delphi in comparison to external benchmarks, that is, trend of reduced variance is so typical that thewith respect to other techniques or means of ag- phenomenon of increased ‘consensus’, per se, nogregating judgment (for example, interacting longer appears to be an issue of experimentalgroups). interest. In Table 2, the main ﬁndings have been coded Where some controversy does exist, however, is inaccording to how they were presented by their whether a reduction in variance over rounds reﬂectsauthors. In some cases, claims have been made that true consensus (reasoned acceptance of a position).an effect or relationship was found even though there Delphi has, after all, been advocated as a method ofwas no backing statistical analysis, or though a reducing group pressures to conform (e.g., Martino,statistical test was conducted but standard signiﬁ- 1983), and both increased consensus and increasedcance levels were not reached (because of, for conformity will be manifest as a convergence ofexample, small sample sizes, e.g. Jolson & Rossow, panellists’ estimates over rounds (i.e., these factors1971). It has been argued that there are problems in are confounded). It would seem in the literature thatusing and interpreting ‘P’ ﬁgures in experiments and reduced variance has been interpreted according tothat effect sizes may be more useful statistics, the position on Delphi held by the particular author /particularly when comparing the results of indepen- s, with proponents of Delphi arguing that resultsdent studies (e.g., Rosenthal, 1978; Cohen, 1994). In demonstrate consensus, while critics have arguedmany Delphi studies, however, effect sizes are not that the ‘consensus’ is often only ‘apparent’, and thatreported in detail. Thus, Table 2 has been annotated the convergence of responses is mainly attributableto indicate whether claimed results were signiﬁcant to other social-psychological factors leading to con-at the P 5 0.05 level (*); whether statistical tests formity (e.g., Sackman, 1975; Bardecki, 1984;were conducted but proved non-signiﬁcant (NS); or Stewart, 1987). Clearly, if panellists are being drawnwhether there was no direct statistical analysis of the towards a central value for reasons other than aresult (NDA). Readers are invited to read whatever genuine acceptance of the rationale behind thatinterpretation they wish into claims of the statistical position, then inefﬁcient process-loss factors are stillsigniﬁcance of ﬁndings. present in the technique. Alternative measures of consensus have been taken, such as ‘post-group consensus’. This concerns5. Findings the extent to which individuals – after the Delphi process has been completed – individually agree In this section, we consider the results obtained by with the ﬁnal group aggregate, their own ﬁnal roundthe evaluative studies as summarised in Table 2, to estimates, or the estimates of other panellists. Rohr-which the reader is referred. We will return to the baugh (1979), for example, compared individuals’details in Table 1 in our subsequent critique. post-group responses to their aggregate group re- sponses, and seemed to show that reduction in5.1. Consensus ‘disagreement’ in Delphi groups was signiﬁcantly less than the reduction achieved with an alternative One of the aims of using Delphi is to achieve technique (Social Judgment Analysis). Furthermore,greater consensus amongst panellists. Empirically, he found that there was little increase in agreementconsensus has been determined by measuring the in the Delphi groups. This latter ﬁnding seems tovariance in responses of Delphi panellists over suggest that panellists were simply altering theirrounds, with a reduction in variance being taken to estimates in order to conform to the group withoutindicate that greater consensus has been achieved. actually changing their opinions (i.e., implying con-Results from empirical studies seem to suggest that formity rather than genuine consensus).variance reduction is typical, although claims tend to Erffmeyer and Lane (1984) correlated post-groupbe simply reported unanalysed (e.g., Dalkey & individual responses to group scores and found thereHelmer, 1963), rather than supported by analysis to be signiﬁcantly more ‘acceptance’ (i.e., signiﬁ-(though see Jolson & Rossow, 1971). Indeed, the cantly higher correlations between these measures) in
364 G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 – 375an alternative structured group technique than in a ing a ﬁnal round Delphi aggregate to that of the ﬁrstDelphi procedure, although there were no differences round is thus, effectively, a within-subjects com-in acceptance between the Delphi groups and a parison of techniques (Delphi versus staticizedvariety of other group techniques (see Table 2). group); when comparison occurs between Delphi andUnfortunately, no analysis was reported on the a separate staticized group then it is a between-difference between the correlations of the post-group subjects one. Clearly, the former comparison isindividual responses and the ﬁrst and ﬁnal round preferable, given that it controls for the highlygroup aggregate responses. variable inﬂuence of subjects. Although the com- An alternative slant on this issue has been pro- parison of round averages should be possible invided by Bardecki (1984), who reported that – in a every study considering Delphi accuracy / quality, astudy not fully described – respondents with more number of evaluative studies have omitted to reportextreme views were more likely to drop out of a round differences (e.g., Fischer, 1981; Riggs, 1983).Delphi procedure than those with more moderate Comparisons of relative accuracy of Delphi panelsviews (i.e., nearer to the group average). This with ﬁrst round aggregates and staticized groups aresuggests that consensus may be due – at least in part reported in Table 3.– to attrition. Further empirical work is needed to Evidence for Delphi effectiveness is equivocal, butdetermine the extent to which the convergence of results generally support its advantage over ﬁrstthose who do not (or cannot) drop out of a Delphi round / staticized group aggregates by a tally of 12procedure are due to either true consensus or to studies to two. Five studies have reported signiﬁcantconformity pressures. increases in accuracy over Delphi rounds (Best, ´ 1974; Larreche & Moinpour, 1983; Erffmeyer & Lane, 1984; Erffmeyer et al., 1986; Rowe & Wright,5.2. Increased accuracy 1996), although the two papers of Erffmeyer et al. may be reports of separate analyses on the same data Of main concern to the majority of researchers is (this is not clear). Seven more studies have producedthe ability of Delphi to lead to judgments that are qualiﬁed support for Delphi: in ﬁve cases, Delphi ismore accurate than (a) initial, pre-procedure aggre- found to be better than statistical or ﬁrst roundgates (equivalent to equal-weighted staticized aggregates more often than not, or to a degree thatgroups), and (b) judgments derived from alternative does not reach statistical signiﬁcance (e.g., Dalkey etgroup procedures. In order to ensure that accuracy al., 1970; Brockhoff, 1975; Rohrbaugh, 1979; Dietz,can be rapidly assessed, problems used in Delphi 1987; Sniezek, 1989), and in two others it is shownstudies have tended to involve either short-range to be better under certain conditions and not othersforecasting tasks, or tasks requiring the estimation of ´ (i.e., in Parente et al., 1984, Delphi accuracy in-almanac items whose quantitative values are already creases over rounds for predicting ‘when’ an eventknown to the experimenters and about which sub- might occur, but not ‘if’ it will occur; in Jolson &jects are presumed capable of making educated Rossow, 1971, it increases for panels comprisingguesses. In evaluative studies, the long-range fore- ‘experts’, but not for ‘non-experts’).casting and policy formation items (etc.) that are In contrast, two studies found no substantialtypical in Delphi applications are rarely used. difference in accuracy between Delphi and staticized Table 2 reports the results of the evaluation of groups (Fischer, 1981; Sniezek, 1990), while twoDelphi with regards to the various benchmarks, others suggested Delphi accuracy was worse. Gustaf-while Tables 3–5 collate and summarise the com- son et al. (1973) found that Delphi groups were lessparisons according to speciﬁc benchmarks. accurate than both their ﬁrst round aggregates (for seven out of eight items) and than independent5.2.1. Delphi versus staticized groups staticized groups (for six out of eight items), while The average estimate of Delphi panellists on the Boje and Murnighan (1982) found that Delphi panelsﬁrst round – prior to iteration or feedback – is became less accurate over rounds for three out ofequivalent to that from a staticized group. Compar- four items.
G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 – 375 365Table 3Comparisons of Delphi to staticized (round 1) groups aStudy Criteria Trend of relationship Additional comments (moderating variable)Dalkey et al. (1970) Accuracy D.R (NDA) 81 groups became more accurate, 61 became lessJolson and Rossow (1971) Accuracy D.R (experts) (NDA) For both items in each case Accuracy D,R (‘non experts’) (NDA)Gustafson et al. (1973) Accuracy D,S (NDA) D worse than S on six of eight D,R (NDA) items, and than R on seven of eight itemsBest (1974) Accuracy D.R* For each of two itemsBrockhoff (1975) Accuracy D.R (almanac items) Accuracy generally increased D.R (forecast items) to r3, then started to decreaseRohrbaugh (1979) Accuracy D.R (NDA) No direct comparison made, but trend implied in resultsFischer (1981) Accuracy D5S Mean group scores were virtually indistinguishableBoje and Murnighan (1982) Accuracy R.D* Accuracy in D decreased over rounds for three out of four items ´Larreche and Moinpour (1983) Accuracy D.R* ´Parente et al. (1984) Accuracy D.R (WHEN items) (NDA) The ﬁndings noted here are the Accuracy D5R (IF items) (NDA) trends as noted in the paperErffmeyer and Lane (1984) Quality D.R* Of four procedures, D seemed to improve most over roundsErffmeyer et al. (1986) Quality D.R* Most improvement in early roundsDietz (1987) Accuracy D.R Error of D panels decreased from r1 to r2 to r3Sniezek (1989) Accuracy D.R Small improvement over rounds but not signiﬁcantSniezek (1990) Accuracy S.D (one of ﬁve items) No reported comparison of accuracy change over D roundsRowe and Wright (1996) Accuracy D (reasons f’back).R* Accuracy D (stats f’back).R* a See Table 1 for explanation of abbreviations.5.2.2. Delphi versus interacting groups interacting groups have been compared are presented Another important comparative benchmark for in Table 4.Delphi performance is provided by interacting Research supports the relative efﬁcacy of Delphigroups. Indeed, aspects of the design of Delphi are over interacting groups by a score of ﬁve studies tointended to pre-empt the kinds of social / psychologi- one with two ties, and with one study showingcal / political difﬁculties that have been found to task-speciﬁc support for both techniques. Support forhinder effective communication and behaviour in Delphi comes from Van de Ven and Delbecq (1974),groups. The results of studies in which Delphi and ´ Riggs (1983), Larreche and Moinpour (1983),
366 G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 – 375Table 4Comparisons of Delphi to interacting groups aStudy Criteria Trend of relationship Additional comments (moderating variable)Gustafson et al. (1973) Accuracy I.D (NDA) D worse than I on ﬁve of eight itemsVan de Ven and Delbecq (1974) Number of ideas D.I*Brockhoff (1975) Accuracy D.I (almanac items) (NDA) Comparisons were between I I.D (forecast items) (NDA) and r3 of DFischer (1981) Accuracy D5I Mean group scores were indistinguishable ´Larreche and Moinpour (1983) Accuracy D.I*Riggs (1983) Accuracy D.I* Result was signiﬁcant for both high and low information tasksErffmeyer and Lane (1984) Quality D.I*Sniezek (1989) Accuracy D.I (NDA) Small sample sizesSniezek (1990) Accuracy D5I No difference at P 5 0.10 level on comparisons on the ﬁve items a See Table 1 for explanation of abbreviations.Erffmeyer and Lane (1984), and Sniezek (1989). needs to be done quickly, while Delphi would be aptFischer (1981) and Sniezek (1990) found no dis- when experts cannot meet physically; but whichtinguishable differences in accuracy between the two technique is preferable when both are options?approaches, while Gustafson et al. (1973) found a Results of Delphi–NGT comparisons do not fullysmall advantage for interacting groups. Brockhoff answer this question. Although there is some evi-(1975) seemed to show that the nature of the task is dence that NGT groups make more accurate judg-important, with Delphi being more accurate than ments than Delphi groups (Gustafson et al., 1973;interacting groups for almanac items, but the reverse Van de Ven & Delbecq, 1974), other studies havebeing the case for forecasting items. found no notable differences in accuracy / quality between them (Miner, 1979; Fischer, 1981; Boje &5.2.3. Delphi versus other procedures Murnighan, 1982), while one study has shown Although evidence suggests that Delphi does Delphi superiority (Erffmeyer & Lane, 1984).generally lead to improved judgments over staticized Other studies have compared Delphi to: groups ingroups and unstructured interacting groups, it is which members were required to argue both for andclearly of interest to see how Delphi performs in against their individual judgments (the ‘Dialectic’comparison to groups using other structured pro- procedure – Sniezek, 1989); groups whose judg-cedures. Table 5 presents the ﬁndings of studies ments were derived from a single, group-selectedwhich have attempted such comparisons. individual (the ‘Dictator’ or ‘Best Member’ strategy A number of studies have compared Delphi to the – Sniezek, 1989, 1990); groups that received rulesNominal Group Technique or NGT (also known as on interaction (Erffmeyer & Lane, 1984); groupsthe ‘estimate-talk-estimate’ procedure). NGT uses whose information exchange was structured accord-the basic Delphi structure, but uses it in face-to-face ing to Social Judgment Analysis (Rohrbaugh, 1979);meetings that allow discussion between rounds. and groups following a Problem Centred LeadershipClearly, there are practical situations where one (PCL) approach (Miner, 1979). The only studies thattechnique may be viable and not the other, for found any substantial differences between Delphiexample NGT would seem appropriate when a job and the comparison procedure / s are those of
G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 – 375 367Table 5Comparisons of Delphi to other structured group procedures aStudy Criteria Trend of relationship Additional comments (moderating variable)Gustafson et al. (1973) Accuracy NGT.D* NGT-like technique was more accurate than D on each of eight itemsVan de Ven and Delbecq (1974) [ of ideas NGT.D NGT gps generated, on average, 12% more unique ideas than D gpsMiner (1979) Quality PCL.NGT.D PCL was more effective than D (P , 0.01) (see Table 2)Rohrbaugh (1979) Accuracy D5SJ No P statistic reported due to nature of study / analysisFischer (1981) Accuracy D5NGT Mean group scores were similar, but NGT very slightly better than DBoje and Murnighan (1982) Accuracy D5NGT On the four items, D was slightly more accurate than NGT gps on ﬁnal rErffmeyer and Lane (1984) Quality D.NGT* Guided gps followed guidelines D.Guided Group* on resolving conﬂictSniezek (1989) Accuracy Dictator5D5Dialectic Dialectic gps argued for and against (NDA) positions. Dictator gps chose one member to make gp decisionSniezek (1990) Accuracy D.BM (one BM5Dictator (above): gp chooses of ﬁve items) one member to make decision a See Table 1 for explanation of abbreviations.Erffmeyer and Lane (1984), which found Delphi to poor backgrounds in the social sciences and lackedbe better than groups that were given instructions on acquaintance with appropriate research methodolo-resolving conﬂict, and Miner (1979), which found gies (e.g., Sackman, 1975).that the PCL approach (which involves instructing Over the past few decades, empirical evaluationsgroup leaders in appropriate group-directing skills) to of the Delphi technique have taken on a morebe signiﬁcantly more ‘effective’ than Delphi. systematic and scientiﬁc nature, typiﬁed by the controlled comparison of Delphi-like procedures with other group and individual methods for obtain-6. A critique of technique-comparison studies ing judgments (i.e., technique comparison). Although the methodology of these studies has provoked a Much of the criticism of early Delphi studies degree of support (e.g., Stewart, 1987), we havecentred on their ‘sloppy execution’ (e.g., Stewart, queried the relevance of the majority of this research1987). Among speciﬁc criticisms were claims that to the general question of Delphi efﬁcacy (e.g., RoweDelphi questionnaires were poorly worded and am- et al., 1991), pointing out that most of such studiesbiguous (e.g., Hill & Fowles, 1975) and that the have used versions of Delphi somewhat removedanalysis of responses was often superﬁcial (Linstone, from the ‘classical’ archetype of the technique1975). Reasons given for the poor conduct of early (consider Table 1).studies ranged from the technique’s ‘apparent sim- To begin with, the majority of studies have usedplicity’ encouraging people without the requisite structured ﬁrst rounds in which event statements –skills to use it (Linstone & Turoff, 1975), to devised by experimenters – are simply presented tosuggestions that the early Delphi researchers had panellists for assessment, with no opportunity for
368 G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 – 375them to indicate the issues they believe to be of possess similar knowledge (indeed, certain studiesgreatest importance on the topic (via an unstructured have even tried to standardise knowledge prior to theround), and thus militating against the construction Delphi manipulation; e.g., Riggs, 1983). Indeed, theof coherent task scenarios. The concern of research- few laboratory studies that have used non-studentsers to produce simple experimental designs is com- have tended to use professionals from a singlemendable, but sometimes the consequence is over- domain (e.g., Brockhoff, 1975; Leape et al., 1992;simpliﬁcation and designs / tasks / measures of uncer- Ono & Wedermeyer, 1994). This point is important:tain relevance. Indeed, because of a desire to verify if varied information is not available to be shared,rapidly the accuracy of judgments, a number of then what could possibly be the beneﬁt of anyDelphi studies have used almanac questions involv- group-like aggregation over and above a statisticaling, for example, estimating the diameter of the aggregation procedure?planet Jupiter (e.g., as in Gustafson et al., 1973; There is some evidence that panels composed ofMulgrave & Ducanis, 1975; Boje & Murnighan, relative experts tend to beneﬁt from a Delphi pro-1982). Unfortunately, it is difﬁcult to imagine how cedure to a greater extent than comparative aggre-Delphi might conceivably beneﬁt a group of un- gates of novices (e.g., Jolson & Rossow, 1971;knowledgeable subjects – who are largely guessing ´ Larreche & Moinpour, 1983), suggesting that typicalthe order of magnitude of some quantitatively large laboratory studies may underestimate the value ofand unfamiliar statistic, such as the tonnage of a Delphi. Indeed, the relevance of the expertise ofcertain material shipped from New York in a certain members of a staticized group, and the extent toyear – in achieving a reasoned consideration of which their judgments are shared / correlated, haveothers’ knowledge, and thus of improving the ac- been shown to be important factors related tocuracy of their estimates. However, even when the accuracy in the theoretical model of Hogarth (1978),items are sensible (e.g., of short-term forecasting which has been empirically supported by a numbertype), their pertinence must also depend on the of studies (e.g., Einhorn, Hogarth & Klempner, 1977;relevance of the expertise of the panellists: ‘sensible’ Ashton, 1986).questions are only sensible if they relate to the The absolute and relative expertise of a set ofdomain of knowledge of the speciﬁc panellists. That individuals is also liable to interact with the effec-is, student subjects are unlikely to respond to, for tiveness of different group techniques – not onlyexample, a task involving the forecast of economic statistical groups and Delphi, but also interactingvariables, in the same way as economics experts – groups and techniques such as the NGT. Because thenot only because of lesser knowledge on the topic, nature of such relationships are unknown, it is likelybut because of differences in aspects such as their that, for example, though a Delphi-like proceduremotivation (see Bolger & Wright, 1994, for discus- might prove more effective than an NGT approach atsion of such issues). enhancing judgment with a particular set of in- Delphi, however, was ostensibly designed for use dividuals, a panel possessing different ‘expertise’with experts, particularly in cases where the variety attributes might show the reverse trend. The un-of relevant factors (economic, technical, etc.) ensures comfortable conclusion from this line of argument isthat individual panellists have only limited knowl- that it is difﬁcult to draw grand conclusions on theedge and might beneﬁt from communicating with relative efﬁcacy of Delphi from technique-compari-others possessing different information (e.g., Helmer, son studies that have made no clear attempt at1975). This use of diverse experts is, however, rarely specifying or describing the characteristics of theirfound in laboratory situations: for the sake of panels and the relevant features of the task. Further-simpliﬁcation, most studies employ homogenous more, any particular demonstration of Delphi efﬁca-samples of undergraduate or graduate students (see cy cannot be taken as an indicator of the moreTable 1), who are required to make judgments about general validity of the technique, since the demon-scenarios in which they can by no means be consid- strated effect will be partly dependent on the speciﬁcered expert, and about which they are liable to characteristics of the panel and the task.
G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 – 375 369 A second major difference between laboratory tion and reductionism in order to study highlyDelphis and the Delphi ideal lies in the nature of the complex phenomena seems to be at the root of thefeedback presented to panellists (see Table 1). The problem, such that experimental Delphis have tendedfeedback recommended in the ‘classical’ Delphi to use artiﬁcial tasks (that may be easily validated),comprises medians or distributions plus arguments student subjects, and simple feedback in the place offrom panellists whose estimates fall outside the meaningful and coherent tasks, experts / profession-upper and lower quartiles (e.g., Martino, 1983). In als, and complex feedback. But simpliﬁcation can gothe majority of experimental studies, however, feed- too far, and in the case of much of the Delphiback usually comprises only medians or means (e.g., research it would appear to have done so.Jolson & Rossow, 1971; Riggs, 1983; Sniezek, 1989, In order that research is conducted more fruitfully1990; Hornsby et al., 1994), or a listing of the in the future, two major recommendations wouldindividual panellists’ estimates or response distribu- appear warranted from the above discussion. Thetions (e.g., Dalkey & Helmer, 1963; Gustafson et al., ﬁrst is that a more precise deﬁnition of Delphi is1973; Fischer, 1981; Taylor, Pease & Reid, 1990; needed in order to prevent the technique from beingLeape et al., 1992; Ono & Wedermeyer, 1994), or misrepresented in the laboratory. Issues that par-some combination of these (e.g., Rohrbaugh, 1979). ticularly need clariﬁcation include: the type ofAlthough feedback of reasons or rationales behind feedback that constitutes Delphi feedback; theindividuals’ estimates has sometimes taken place criteria that should be met for convergence of ´(e.g., Boje & Murnighan, 1982; Larreche & Moin- opinion to signal the cessation of polling; the selec-pour, 1983; Erffmeyer & Lane, 1984; Rowe & tion procedures that should be used to determineWright, 1996) this form of feedback is rare. This number and type of panellists, and so on. Althoughvariability of feedback, however, must be considered suggestions on these issues do exist in the literature,a crucial issue, since it is through the medium of the fact that they have largely been ignored byfeedback that information is conveyed from one researchers suggests that they have not been ex-panellist to the next, and by limiting feedback one pressed as explicitly or as strongly as they mightmust also limit the scope for improving panellist have been, and that the importance of deﬁnitionalaggregate accuracy. The issue of feedback will be precision has been underestimated by all.considered more fully in the next section: sufﬁce it to The second recommendation is that a much greatersay here that, once more, it is apparent that sim- understanding is required of the factors that mightpliﬁed laboratory versions of Delphi tend to differ inﬂuence Delphi effectiveness, in order to preventsigniﬁcantly from the ideal in such a way that they the technique’s potential utility from being under-may limit the effectiveness of the technique. estimated through accurate representations used in From the discussion so far it has been suggested unsympathetic experimental scenarios. It would seemthat the majority of the recent studies that have likely that such research would also help in clarify-attempted to address Delphi effectiveness have dis- ing precisely what those aspects of Delphi adminis-covered little of general consequence about Delphi. tration are that need more speciﬁc deﬁnition. ForThat research has yielded so little of substance would example, if ‘number of panellists’ was empiricallysuggest that there are deﬁnitional problems with demonstrated to have no inﬂuence on the effective-Delphi that militate against standard administrations ness of variations of Delphi, then it would be clearof the procedure (e.g., Hill & Fowles, 1975). Fur- that no particular comment or qualiﬁer would bethermore, the lack of concern of experimenters in required on this topic in the technique deﬁnition.following even the fairly loose deﬁnitional guidelinesthat do exist suggests that there is a general lack ofappreciation of the importance of factors like the 7. Findings of process studiesnature of feedback in contingently inﬂuencing thesuccess of procedures like Delphi. The requirement Technique-comparison studies ask the question:of empirical social science research to use simpliﬁca- ‘does Delphi work?’, and yet they use technique
370 G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 – 375forms that differ from one study to the next and that 7.1. The role of feedbackoften differ from the ideal of Delphi. Process studiesask: ‘what is it about Delphi that makes it work?’, Feedback is the means by which information isand this involves both asking and answering the passed between panellists so that individual judg-question: ‘what is Delphi?’ ment may be improved and debiasing may occur. We believe that the emphasis of research should But there are many questions about feedback thatbe shifted from technique-comparison studies to need to be answered. For example, which individualsprocess studies. The latter should focus on the way change their judgments in response to Delphi feed-in which an initial statistical group is transformed by back – the least conﬁdent, the most dogmatic, thethe Delphi process into a ﬁnal round statistical least expert? What is it about feedback that encour-group. To be more precise, the transformation of ages change? Do different types of feedback causestatistical group characteristics must come about different types of people to change? And so on.through changes in the nature of the panellists, which Although studies such as those of Dalkey andare reﬂected by changes over rounds in their judg- Helmer (1963) and Dalkey et al. (1970) wouldments (and participation), which are manifest as appear to have examined the role of feedback, this ischanges in the individual accuracy of the ‘group’ not the case, since feedback is confounded by themembers, their judgmental intercorrelations, and unknown role of ‘iteration’, and since only singleindeed – when panellists drop out over rounds – feedback formats were used in each study. Whateven their group size (as per Hogarth, 1978). studies like those of Scheibe et al. (1975) do reveal, One conceptualisation of the problem is to view however, is the powerful inﬂuence of feedback: injudgment change over rounds as coming about their case, by demonstrating the pull that even falsethrough the interaction of internal factors related to feedback can have on panellists. Evidence from mostthe individual (such as personality styles and traits), Delphi studies shows that convergence towards theand external factors related to technique design and ‘group’ average is typical (see section on Consen-the general environment (such as the inﬂuence of sus).feedback type and the nature of the judgment task). However, more systematic efforts at assessing theFrom this perspective, the role of research should be role of feedback have been attempted by otherto identify which factors are most important in ´ authors. Parente et al. (1984), in a series of experi-explaining how and why an individual changes his / ments, used student subjects to forecast ‘if’ andher judgment, and which are related to change in the ‘when’ a number of newsworthy events would occur.desired direction (i.e., towards more accurate judg- In their third study they attempted to separate out thement). Only by understanding such relationships will iteration and feedback components by decomposingit be possible to explain the contingent differences in Delphi. They appeared to demonstrate that, althoughjudgment-enhancing utility of the various available neither iterated polling nor consensus feedback hadprocedures which include Delphi. Importantly from any apparent effect upon group ‘if’ accuracy, itera-this perspective – and unlike in technique-compari- tion alone resulted in increased accuracy for ‘when’son studies – analysis should be conducted at the a predicted event would occur – while feedbacklevel of the individual panellist, rather than at the alone did not. Similarly, Boje and Murnighan (1982)level of the nominal group, since the non-interacting found decreased accuracy over rounds for a ‘stan-nature of Delphi ensures that individual responses dard’ Delphi procedure, while a control procedureand changes are central and measurable and (unlike that simply entailed the iteration of the questionnaireinteracting groups) the ultimate group response is (with no feedback) resulted in increased accuracyeffectively ‘related to the sum of the parts’ and is not (perhaps because of the opportunity for individualsholistically ‘greater than’ these (e.g., Rowe et al., to further reﬂect on the problem). However, as noted1991). The results of those few studies that we earlier, Delphi studies have tended to involve onlyconsider to be of the process type are collated in ´ superﬁcial types of feedback. In Parente et al.subsequent sections. (1984), the feedback comprised modes and medians,
G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 – 375 371while in the study of Boje and Murnighan (1982), as prescribed in the ‘classical’ deﬁnition, might leadindividuals’ justiﬁcations for their estimates were to improved Delphi performance, this is not to sayused as feedback along with the estimates themselves that ‘proper’ Delphi feedback is liable to be the best(no median or other average statistic was used). type – merely that it is liable to lead to trends in If different types of feedback differentially in- judgment change that will differ from those in manyﬂuence individuals to change their judgments over evaluative studies of ‘Delphi’. Indeed, given therounds, then comparisons across the various Delphi limited nature of recommended feedback in theversions become difﬁcult. Unfortunately, there are a classic procedure, the question remains as to hownumber of reasons – empirical, intuitive and theoret- effective even that type will be, given that theical – to suggest that such differential effects do majority of individual panellists are allowed so littleexist. In Rowe and Wright (1996) we compared input (e.g., those within the upper and lower quartiles‘iteration’, ‘statistical’ feedback (means and me- are not required to justify their estimates, evendians) and ‘reasons’ feedback (with no averages) and though they might be made for different – and evenfound that the greatest degree of improvement in mutually incompatible – reasons).accuracy over rounds occurred in the ‘reasons’condition. Furthermore, we found that, although 7.2. The nature of panellistssubjects were less inclined to change their forecastswhen receiving ‘reasons’ feedback than the other A number of studies have considered the role oftypes, when they did change their forecasts they Delphi panellists and how their attributes relate totended to become more accurate – which was not the criteria such as the effectiveness (e.g., accuracy) ofcase in the ‘iteration’ and ‘statistical’ conditions. the procedure. One of the main attributes of panel- That feedback of a more profound nature can lists is their ‘expertise’ or ‘knowledgeability’. Asimprove the performance of Delphi groups has also discussed earlier, Delphi is intended for use bybeen shown by Best (1974). He found that, for one disparate experts, and yet most empirical studiesof two task items, a Delphi group that was given have used inexpert (often student) panels. Intuitively,feedback of reasons in addition to a median and the use of experts makes sense, but what doesrange of estimates was signiﬁcantly more accurate research show? A number of ‘process’ studies havethan a Delphi group that was simply provided with considered self-rated expertise, essentially to deter-the latter information. Gowan and McNichols (1993) mine whether self-ratings might be useful for select-considered the relative inﬂuences of three types of ing panels. Perhaps unsurprisingly – given theDelphi feedback: statistical, regression model, and number of ways in which self-ratings may be taken –if-then rules. They could not measure the accuracy of results have been equivocal, with a number ofjudgments (the task involved considering the econ- studies suggesting that self-assessment is a validomic viability of companies on the basis of ﬁnancial procedure (e.g., Dalkey et al., 1970; Best, 1974;ratios), but they did ﬁnd that (if-then) rule feedback Rowe & Wright, 1996), and others suggesting that itinduced a signiﬁcantly greater degree of ‘consensus’ ´ is not (Brockhoff, 1975; Larreche & Moinpour,than the other types. 1983). But these results say more about the utility of That different feedback types are differentially a particular expertise measure than the role ofrelated to factors such as accuracy, change, and experts in Delphi.consensus should be of no surprise. Indeed, in the Jolson and Rossow (1971), however, found thatﬁeld of social psychology there has been much accuracy increased over rounds for expert groups butresearch on judgment, conformity, and opinion not for inexpert ones (though they used only twochange in interacting groups, and how these relate to groups making judgments on only two problems).the type of information exchanged (e.g., see Deutsch Similarly, Riggs (1983) found that panels were more& Gerard, 1955; Myers, 1978; Isenberg, 1986), but accurate in predicting the result of a college footballsuch concerns seem not to have inﬁltrated the Delphi game on which they had more information (i.e., weredomain. Although the use of ‘reasons’ in feedback, more ‘expert’) than one on which they had less
372 G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 – 375(although a more interesting comparison might have of quality is generally inappropriate (Armstrong,been of the relative improvement in the two cases). 1985, reviews studies on ‘conﬁdence’ more broadly).Rowe and Wright (1996) found that the most accur- A more interesting conceptualisation of conﬁdence isate Delphi panellists on ﬁrst rounds changed their as a potential predictor of panellists’ propensity toestimates less over rounds than those who were change their estimates in the face of feedback.initially less accurate (and hence who were, arguab- Scheibe et al. (1975) found a positive relationshiply, less ‘expert’). Importantly, this result appears to between these factors (conﬁdence and change), butsupport the Theory of Errors (see, for example, Rowe and Wright (1996) found no evidence for this. ´ ´Parente & Anderson-Parente, 1987), in which ac- We therefore have no consistent evidence that initialcuracy is improved over rounds as a consequence of conﬁdence explains judgment change over Delphithe panel experts ‘holding-out’, while the less-expert rounds.panellists ‘swing’ towards the group average. The Two studies have considered the impacts of autility of expertise has been reviewed elsewhere, number of personality factors on the Delphi process.with evidence suggesting that there is an interaction Taylor et al. (1990) found no relationship betweenbetween expertise and the nature of the task, so that four demographic factors (e.g., gender and education)expertise is only helpful up to a certain level for and the ‘effectiveness’ of Delphi, or whether panel-forecasting tasks, but of greater importance for lists dropped out. Mulgrave and Ducanis (1975)estimation tasks (e.g., Welty, 1974; Armstrong, considered panellist dogmatism, ﬁnding that the most1985). More controlled experiments are required to dogmatic panellists changed judgments the most overexamine how expertise interacts with aspects of the rounds – although the authors had no explanation forDelphi technique, and how it relates to accuracy this rather counter-intuitive result and reported noimprovement over rounds. statistical analysis. Panellist conﬁdence has been studied from a The impact of the number of panellists has beennumber of perspectives. One conceptualisation of considered by Brockhoff (1975) (who used groups ofconﬁdence is as an outcome measure. For example, ﬁve, seven, nine, and 11) and Boje and MurnighanSniezek (1992) has pointed out that panel conﬁdence (1982) (using groups of three, seven, and 11).may be the only available measure of the quality of a Neither of these studies found a consistent relation-decision (e.g., since one cannot determine the accura- ship between panel size and effectiveness criteria.cy of forecasts a priori), and from this sense it isimportant that ‘conﬁdence’ in some way correlates toother quality measures. Although Rowe and Wright 8. Conclusion(1996) found that both conﬁdence and accuracyincreased over rounds, they found no clear relation- This paper reviews research conducted on theship between the accuracy and conﬁdence of the Delphi technique. In general, accuracy tends toindividual panellists. Conversely, Boje and Mur- increase over Delphi rounds, and hence tends to benighan (1982) found that while conﬁdence increased greater than in comparative staticized groups, whileover rounds, accuracy decreased. Sniezek has also Delphi panels also tend to be more accurate thancompared panellist conﬁdence to accuracy with unstructured interacting groups. The technique hascontradictory results (evidence for a positive rela- shown no clear advantages over other structuredtionship in Sniezek, 1989; but no evidence for such a procedures.relationship in Sniezek, 1990). Dietz (1987) attempt- Various difﬁculties exist in research of this tech-ed to weight panellist estimates according to how nique-comparison type, however. Our main concernconﬁdent they were, but found that such a weighting is with the sheer variety of technique formats thatprocess gave less accurate results than a standard have been used as representative of Delphi, varyingequal-weighting one. from the technique ‘ideal’ (and from each other) on From these inconsistent results in Delphi studies, aspects such as the type of feedback used and thewe conclude that the use of conﬁdence as a measure nature of the panellists. If one uses a technique
G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 – 375 373format that varies from the ‘ideal’ on a factor that is Brockhoff, K. (1975). The performance of forecasting groups in computer dialogue and face to face discussions. In: Linstone,shown to inﬂuence the performance of the technique, H., & Turoff, M. (Eds.), The Delphi method: techniques andthen one is essentially studying a different technique. applications, Addison-Wesley, London.One tentative conclusion from this is that the rec- Brockhoff, K. (1984). Forecasting quality and information. Jour-ommended manner of comprising Delphi groups (as nal of Forecasting 3, 417–428.commonly used in real-world applications, e.g. Mar- Cohen, J. (1994). The Earth is round ( p , .05). Americantino, 1983) may well lead to greater enhancement of Psychologist December, 997–1003. Dagenais, F. (1978). The reliability and convergence of the Delphiaccuracy / quality than might be expected to arise technique. The Journal of General Psychology 98, 307–308.from the typical laboratory panel. In this case, it is Dalkey, N. C., Brown, B., & Cochran, S. W. (1970). The Delphipossible that potential exists for Delphi (conducted method III: use of self-ratings to improve group estimates.properly) to produce results far superior to those that Technological Forecasting 1, 283–291.have been demonstrated by research. Dalkey, N. C., & Helmer, O. (1963). An experimental application of the Delphi method to the use of experts. Management Indeed, the critique of technique-comparison Science 9, 458–467.studies is arguably pertinent to wider research Deutsch, M., & Gerard, H. B. (1955). A study of normative andconcerned with determining which is the best of the informational social inﬂuences upon individual judgment.various potential judgment-enhancing techniques. We Journal of Abnormal and Social Psychology 51, 629–636.believe that the focus of research needs to shift from Dietz, T. (1987). Methods for analyzing data from Delphi panels: some evidence from a forecasting study. Technological Fore-comparisons of vague techniques, to studies of the casting and Social Change 31, 79–85.processes related to judgment change and accuracy Einhorn, H. J., Hogarth, R. M., & Klempner, E. (1977). Quality ofwithin groups. We suggest there should be more group judgment. Psychological Bulletin 84, 158–172.research on the role of feedback in Delphi, and how Erffmeyer, R. C., Erffmeyer, E. S., & Lane, I. M. (1986). Theaspects of the task, the measures, and the panellists Delphi technique: an empirical evaluation of the optimalinteract to determine how ﬁrst round Delphi groups number of rounds. Group and Organization Studies 11 (1), 120–128.are transformed to ﬁnal round groups. We need to Erffmeyer, R. C., & Lane, I. M. (1984). Quality and acceptance ofunderstand the underlying processes of techniques an evaluative task: the effects of four group decision-makingbefore we can hope to determine their contingent formats. Group and Organization Studies 9 (4), 509–529.utilities. Felsenthal, D. S., & Fuchs, E. (1976). Experimental evaluation of ﬁve designs of redundant organizational systems. Administra- tive Science Quarterly 21, 474–488. Fischer, G. W. (1981). When oracles fail – a comparison of fourReferences procedures for aggregating subjective probability forecasts. Organizational Behavior and Human Performance 28, 96–110.Armstrong, J. S. (1985). Long range forecasting: from crystal ball Gowan, J. A., & McNichols, C. W. (1993). The efforts of to computer, 2nd ed., Wiley, New York. alternative forms of knowledge representation on decisionArmstrong, J. S., & Lusk, E. J. (1987). Return postage in mail making consensus. International Journal of Man–Machine surveys: a meta-analysis. Public Opinion Quarterly 51 (2), Studies 38, 489–507. 233–248. Gustafson, D. H., Shukla, R. K., Delbecq, A., & Walster, G. W.Ashton, R. H. (1986). Combining the judgments of experts: how (1973). A comparison study of differences in subjective many and which ones? Organizational Behaviour and Human likelihood estimates made by individuals, interacting groups, Decision Processes 38, 405–414. Delphi groups and nominal groups. Organizational BehaviorBardecki, M. J. (1984). Participants’ response to the Delphi and Human Performance 9, 280–291. method: an attitudinal perspective. Technological Forecasting Helmer, O. (1975). Foreward. In: Linstone, H., & Turoff, M. and Social Change 25, 281–292. (Eds.), The Delphi method: techniques and applications, Ad-Best, R. J. (1974). An experiment in Delphi estimation in dison-Wesley, London. marketing decision making. Journal of Marketing Research 11, Hill, K. Q., & Fowles, J. (1975). The methodological worth of the 448–452. Delphi forecasting technique. Technological Forecasting andBolger, F., & Wright, G. (1994). Assessing the quality of expert Social Change 7, 179–192. judgment: issues and analysis. Decision Support Systems 11, Hogarth, R. M. (1978). A note on aggregating opinions. Organi- 1–24. zational Behaviour and Human Performance 21, 40–46.Boje, D. M., & Murnighan, J. K. (1982). Group conﬁdence Hornsby, J. S., Smith, B. N., & Gupta, J. N. D. (1994). The pressures in iterative decisions. Management Science 28, impact of decision-making methodology on job evaluation 1187–1196. outcomes. Group and Organization Management 19, 112–128.
374 G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 – 375Hudak, R. P., Brooke, P. P., Finstuen, K., & Riley, P. (1993). ´ ´ Parente, F. J., & Anderson-Parente, J. K. (1987). Delphi inquiry Health care administration in the year 2000: practitioners’ systems. In: Wright, G., & Ayton, P. (Eds.), Judgmental views of future issues and job requirements. Hospital and forecasting, Wiley, Chichester. Health Services Administration 38 (2), 181–195. Riggs, W. E. (1983). The Delphi method: an experimentalIsenberg, D. J. (1986). Group polarization: a critical review and evaluation. Technological Forecasting and Social Change 23, meta-analysis. Journal of Personality and Social Psychology 89–94. 50, 1141–1151. Rohrbaugh, J. (1979). Improving the quality of group judgment:Jolson, M. A., & Rossow, G. (1971). The Delphi process in social judgment analysis and the Delphi technique. Organiza- marketing decision making. Journal of Marketing Research 8, tional Behavior and Human Performance 24, 73–92. 443–448. Rosenthal, R. (1978). Combining results of independent studies.Kastein, M. R., Jacobs, M., Van der Hell, R. H., Luttik, K., & Psychological Bulletin 85 (1), 185–193. Touw-Otten, F. W. M. M. (1993). Delphi, the issue of Rowe, G., & Wright, G. (1996). The impact of task characteristics reliability: a qualitative Delphi study in primary health care in on the performance of structured group forecasting techniques. the Netherlands. Technological Forecasting and Social Change International Journal of Forecasting 12, 73–89. 44, 315–323. Rowe, G., Wright, G., & Bolger, F. (1991). The Delphi technique: ´Larreche, J. C., & Moinpour, R. (1983). Managerial judgment in a re-evaluation of research and theory. Technological Forecast- marketing: the concept of expertise. Journal of Marketing ing and Social Change 39 (3), 235–251. Research 20, 110–121. Sackman, H. (1975). Delphi critique, Lexington Books, Lex-Leape, L. L., Freshour, M. A., Yntema, D., & Hsiao, W. (1992). ington, MA. Small group judgment methods for determining resource based Saito, M., & Sinha, K. (1991). Delphi study on bridge condition relative values. Medical Care 30, NS28–NS39. rating and effects of improvements. Journal of TransportLinstone, H. A. (1975). Eight basic pitfalls: a checklist. In: Engineering 117, 320–334. Linstone, H., & Turoff, M. (Eds.), The Delphi method: Scheibe, M., Skutsch, M., & Schofer, J. (1975). Experiments in techniques and applications, Addison-Wesley, London. Delphi methodology. In: Linstone, H., & Turoff, M. (Eds.),Linstone, H. A. (1978). The Delphi technique. In: Fowles, J. The Delphi method: techniques and applications, Addison- (Ed.), Handbook of futures research, Greenwood Press, Wes- Wesley, London. tport, CT. Sniezek, J. A. (1989). An examination of group process inLinstone, H. A., & Turoff, M. (1975). The Delphi method: judgmental forecasting. International Journal of Forecasting 5, techniques and applications, Addison-Wesley, London. 171–178.Lock, A. (1987). Integrating group judgments in subjective Sniezek, J. A. (1990). A comparison of techniques for judgmental forecasts. In: Wright, G., & Ayton, P. (Eds.), Judgmental forecasting by groups with common information. Group and forecasting, Wiley, Chichester. Organization Studies 15 (1), 5–19.Lunsford, D. A., & Fussell, B. C. (1993). Marketing business Sniezek, J. A. (1992). Groups under uncertainty: an examination services in central Europe. Journal of Services Marketing 7 (1), of conﬁdence in group decision making. Organizational Be- 13–21. havior and Human Decision Processes 52 (1), 124–155.Martino, J. (1983). Technological forecasting for decision making, Spinelli, T. (1983). The Delphi decision-making process. Journal 2nd ed., American Elsevier, New York. of Psychology 113, 73–80.Miner, F. C. (1979). A comparative analysis of three diverse Stewart, T. R. (1987). The Delphi technique and judgmental group decision making approaches. Academy of Management forecasting. Climatic Change 11, 97–113. Journal 22 (1), 81–93. Taylor, R. G., Pease, J., & Reid, W. M. (1990). A study ofMulgrave, N. W., & Ducanis, A. J. (1975). Propensity to change survivability and abandonment of contributions in a chain of responses in a Delphi round as a function of dogmatism. In: Delphi rounds. Psychology: A Journal of Human Behavior 27, Linstone, H., & Turoff, M. (Eds.), The Delphi method: 1–6. techniques and applications, Addison-Wesley, London. Van de Ven, A. H., & Delbecq, A. L. (1974). The effectiveness ofMyers, D. G. (1978). Polarizing effects of social comparisons. nominal, Delphi, and interacting group decision making pro- Journal of Experimental Social Psychology 14, 554–563. cesses. Academy of Management Journal 17 (4), 605–621.Neiderman, F., Brancheau, J. C., & Wetherbe, J. C. (1991). Van Dijk, J. A. G. M. (1990). Delphi questionnaires versus Information systems management issues for the 1990s. MIS individual and group interviews: a comparison case. Tech- Quarterly 15 (4), 474–500. nological Forecasting and Social Change 37, 293–304.Olshfski, D., & Joseph, A. (1991). Assessing training needs of Welty, G. (1974). The necessity, sufﬁciency and desirability of executives using the Delphi technique. Public Productivity and experts as value forecasters. In: Leinfellner, W., & Kohler, E. Management Review 14 (3), 297–301. (Eds.), Developments in the methodology of social science,Ono, R., & Wedermeyer, D. J. (1994). Assessing the validity of Reidel, Boston. the Delphi technique. Futures 26, 289–304. Wright, G., Lawrence, M. J., & Collopy, F. (1996). The role and ´Parente, F. J., Anderson, J. K., Myers, P., & O’Brien, T. (1984). validity of judgment in forecasting. International Journal of An examination of factors contributing to Delphi accuracy. Forecasting 12, 1–8. Journal of Forecasting 3 (2), 173–182.
G. Rowe, G. Wright / International Journal of Forecasting 15 (1999) 353 – 375 375Biographies: Gene ROWE is an experimental psychologist who George WRIGHT is a psychologist with an interest in thegained his PhD from the Bristol Business School at the University judgmental aspects of forecasting and decision making. He isof the West of England (UWE). After some years at UWE, then at Editor of the Journal of Behavioral Decision Making and anthe University of Surrey, he is now at The Institute of Food Associate Editor of the International Journal of Forecasting andResearch (IFR), Norwich. His research interests have ranged from the Journal of Forecasting. He has published in such journals asexpert systems and group decision support, to judgment and Management Science, Current Anthropology, Journal of Directdecision making more generally. Lately, he has been involved in Marketing, Memory and Cognition, and the International Journalresearch on risk perception and public participation mechanisms in of Information Management.risk assessment and management.