Seediscussions,stats,andauthorprofilesforthispublicati.docx

See discussions, stats, and author profiles for
this publication at:
http://www.researchgate.net/publication/8157458
The Hamilton Depression Rating Scale: Has the
Gold Standard Become a Lead Weight?
ARTICLE in AMERICAN JOURNAL OF
PSYCHIATRY · JANUARY 2005
Impact Factor: 12.3 · DOI:
10.1176/appi.ajp.161.12.2163 · Source: PubMed
CITATIONS
458
READS
2,176
4 AUTHORS, INCLUDING:
R. Michael Bagby
University of Toronto
347 PUBLICATIONS 16,312 CITATIONS
SEE PROFILE

Andrew G Ryder
ConcordiaUniversity Montreal
78 PUBLICATIONS 2,268 CITATIONS
SEE PROFILE
Available from: R. Michael Bagby
Retrieved on: 19 December 2015
Am J Psychiatry 161:12, December 2004 2163
Reviews and Overviews
http://ajp.psychiatryonline.org
The Hamilton Depression Rating Scale:
Has the Gold Standard Become a Lead Weight?
R. Michael Bagby, Ph.D.
Andrew G. Ryder, M.A.
Deborah R. Schuller, M.D.
Margarita B. Marshall, B.Sc.
Objective: The Hamilton Depression Rat-
ing Scale has been the gold standard for the
assessment of depression for more than 40

years. Criticism of the instrument has been
increasing. The authors review studies pub-
lished since the last major review of this in-
strument in 1979 that explicitly examine
the psychometric properties of the Hamil-
ton depression scale. The authors’ goal is to
determine whether continued use of the
Hamilton depression scale as a measure of
treatment outcome is justified.
Method: MEDLINE was searched for stud-
ies published since 1979 that examine
psychometric properties of the Hamilton
depression scale. Seventy studies were
identified and selected, and then grouped
into three categories on the basis of the
major psychometric properties exam-
ined—reliability, item-response character-
istics, and validity.
Results: The Hamilton depression scale’s
internal reliability is adequate, but many
scale items are poor contributors to the
measurement of depression severity; oth-
ers have poor interrater and retest reliabil-
ity. For many items, the format for re-
sponse options is not optimal. Content
validity is poor; convergent validity and
discriminant validity are adequate. The
factor structure of the Hamilton depres-
sion scale is multidimensional but with
poor replication across samples.
Conclusions: Evidence suggests that the
Hamilton depression scale is psychomet-
rically and conceptually flawed. The

breadth and severity of the problems mil-
itate against efforts to revise the current
instrument. After more than 40 years, it is
time to embrace a new gold standard for
assessment of depression.
(Am J Psychiatry 2004; 161:2163–2177)
The Hamilton Depression Rating Scale (1) was devel-
oped in the late 1950s to assess the effectiveness of the first
generation of antidepressants and was originally pub-
lished in 1960. Although Hamilton (1) recognized that the
scale had “room for improvement” (p. 56) and that further
revision was necessary, the scale quickly became the stan-
dard measure of depression severity for clinical trials of
antidepressants (2, 3). The Hamilton depression scale has
retained this function and is now the most commonly
used measure of depression (3). Our objective in this arti-
cle is to provide a review of the Hamilton depression scale
literature published since the last major evaluation of its
psychometric properties, more than 20 years ago (4). More
recent reviews have appeared (3, 5–7), but they have not
systematically examined the literature with regard to a
broad range of measurement issues. Significant develop-
ments in psychometric theory and practice have been
made since the 1950s and need to be applied to instru-
ments currently in use. We evaluate the Hamilton depres-
sion scale in light of these current standards and conclude
by presenting arguments for and against retaining, revis-
ing, or rejecting the Hamilton depression scale as the gold
standard for assessment of depression.
Method
Studies for the review were identified by means of MEDLINE
searches for both “depression” and “Hamilton.” All studies pub-

lished during the period since the last major review ( January
1980
to May 2003) were considered. Studies selected for review had
to
be explicitly designed to evaluate empirically the psychometric
properties of the instrument or to review conceptual issues re-
lated to the instrument’s development, continued use, and/or
shortcomings. At least 20 published versions of the Hamilton
de-
pression scale exist, including both longer and shortened ver-
sions. This review was limited to studies that examined the
origi-
nal 17-item version, as the majority of the studies that evaluated
the scale’s psychometrics used the 17-item version. Only a
small
number of studies evaluated other versions, and most of these
versions contain the original 17 items. Seventy articles met the
se-
lection criteria and were categorized into three groups on the
ba-
sis of the major psychometric property examined—reliability,
item response, and validity. Table 1 lists the articles included in
the review.
Results
Reliability
Clinician-rated instruments should demonstrate three
types of reliability: 1) internal reliability, 2) retest reliability,
and 3) interrater reliability. Cronbach’s alpha statistic (78)
is used to evaluate internal reliability, and estimates ≥0.70
2164 Am J Psychiatry 161:12, December 2004

HAMILTON DEPRESSION SCALE
TABLE 1. Characteristics of Studies Examining the
Psychometric Properties of the Hamilton Depression Rating
Scalea
% of
Female
Subjects
Psychometric Properties
Examined
Study Year Language N Subjects Reliability
Item
Response Validity
Aben et al. (8) 2002 Dutch 202 46 Stroke patients ×
Addington et al. (9) 1990 English 250 —b Schizophrenia
inpatients ×
Addington et al. (10) 1996 English 112c 60 Schizophrenia
inpatients × ×
Addington et al. (10) 1996 English 89d —b Schizophrenia
inpatients × ×
Akdemir et al. (11) 2001 Turkish 94 66 Psychiatric patients × ×
Baca-García et al. (12) 2001 Spanish 1 100 Dysthymia
outpatient ×
Bech (5) 1981 Danish 66 70 Depressed inpatients × ×
Bech et al. (13) 1992 Multilingual 1,128 —b Psychiatric
patients × ×
Bech et al. (14) 2002 Danish 650 —b Psychiatric patients × ×
Berard and Ahmed (15) 1995 English 22 64 Elderly psychiatric
outpatients × ×

Berrios and Bulbena-
Villarasa (16)
1990 Castilian 1,204 59 Psychiatric outpatients × ×
Brown et al. (17) 1995 English 259 —b Medical outpatients ×
Carroll et al. (18) 1981 English 278 —b Depressed patients ×
Cicchetti and Prusoff (19) 1983
Time 1 English 86 —b Depressed outpatients ×
Time 2 English 81 —b Depressed outpatients ×
Craig et al. (20) 1985 English 32 0 Schizophrenia inpatients × ×
Daradkeh et al. (21) 1997 Arabic 73 58 Depressed inpatients ×
×
Deluty et al. (22) 1986 English 70 39 Psychiatric inpatients × ×
Demitrack et al. (23) 1998 —b 85 66 Professionals/laypersons ×
Entsuah et al. (24) 2002
Sample 1 Multilingual 865 65 Psychiatric patients ×
Faries et al. (25) 2000 —b 1,658 —b Depressed outpatients ×
Feinberg et al. (26) 1981 English —b —b Depressed patients ×
Fleck et al. (27) 1995 French 60 77 Psychiatric outpatients ×
Fuglum et al. (28) 1996 Danish —b —b Depressed patients × ×
Gastpar and Gilsdorf (29) 1990 Multilingual 122 66 Depressed
patients ×
Gibbons et al. (30) 1993 English 370 72 Psychiatric patients × ×
Gilley et al. (31) 1995
Sample 1 English 185 56 Alzheimer’s disease patients × ×
Sample 2 English 54 39 Comparsion subjects with normal
cognition

× ×
Sample 3 English 57 37 Parkinson’s disease patients × ×
Gottlieb et al. (32) 1988 English 43 67 Neurological patients ×
×
Gullion and Rush (33) 1998 English 324 67 Depressed patients
×
Hammond (34) 1998 English 100 74 Elderly medical patients ×
Hooijer et al. (35) 1991 Flemish 56 —b Mental health
professionals ×
Hotopf et al. (36) 1998 English 49 65 Primary care patients ×
Kobak et al. (37) 1999 English 113 —b Psychiatric
patients/community
comparison subjects
× ×
Koenig et al. (38) 1995 English 38 55 Elderly medical patients
×
Lambert et al. (39) 1986 —b 1,850 —b Psychiatric patients ×
Lambert et al. (40) 1988 English 13 31 Psychiatric
inpatients/outpatients ×
Leentjens et al. (41) 2000 Dutch 63 37 Parkinson’s disease
patients ×
Leung et al. (42) 1999 Chinese 93 56 Psychiatric inpatients × ×
McAdams et al. (43) 1996 English 101 23 Schizophrenia
outpatients ×
Maier and Philipp (44) 1985 German 280 —b Psychiatric
outpatients ×
Maier et al. (45) 1988
Sample 1 German 130 —b Psychiatric inpatients × × ×
Sample 2 German 48 —b Psychiatric inpatients × × ×
Maier et al. (46) 1988 German 130 —b Psychiatric inpatients ×
Marcos and Salamero (47) 1990 Spanish 234 76 Community

geriatric subjects ×
Meyer et al. (48) 2001 English 196 68 Medical outpatients ×
Middelboe et al. (49) 1994 Danish 36 64 Medical outpatients ×
Moberg et al. (50) 2001 English 20 70 Geriatric
consultation/liaison patients ×
Mottram et al. (51) 2000 English 433 73 Elderly psychiatric
referrals ×
Naarding et al. (52) 2002
Sample 1 Dutch 44 36 Stroke inpatients ×
Sample 2 Dutch 274 60 Alzheimer’s disease patients ×
Sample 3 Dutch 85 40 Parkinson’s disease patients ×
O’Brien and Glaudin (53) 1988
Sample 1 English 183 70 Psychiatric outpatients ×
Sample 2 English 182 70 Psychiatric outpatients ×
(continued)
BAGBY, RYDER, SCHULLER, ET AL.
reflect adequate reliability (79, 80). The internal reliability
of individual items is calculated by using corrected item-
to-total correlation with Pearson’s r; items should have a
correlation greater than 0.20 (79, 80). Retest reliability as-
sesses the extent to which multiple administrations of the
scale generate the same results. When scores on an instru-
ment are expected to change in response to effective treat-
ment, it is necessary to demonstrate that these scores re-
main the same in the absence of treatment. Interrater

reliability assesses the extent to which multiple raters gen-
erate the same result. Although Pearson’s r is often used to
compute these estimates, the preferred method is the
intraclass r (81), which allows for adjustment for agree-
ment by chance. Estimates of retest and interrater reliabil-
ity should be at a minimum of 0.70 (Pearson’s r) and 0.60
(intraclass r) (82). For retest reliability of scale items, Pear-
son’s r >0.70 is considered acceptable (83).
Internal Reliability
Table 2 summarizes the results from studies examining
internal reliability of the total Hamilton depression scale. Es-
timates ranged from 0.46 to 0.97, and 10 studies reported es-
timates ≥0.70. Table 3 summarizes the studies that exam-
ined internal reliability at the item level. The majority of
Hamilton depression scale items show adequate reliability.
Six items met the reliability criteria in every sample (guilt,
middle insomnia, psychic anxiety, somatic anxiety, gastro-
intestinal, general somatic), and an additional five items met
the criteria in all but one sample (depressed mood, suicide,
early insomnia, late insomnia, work and interests, hypo-
chondriasis). Loss of insight was the item with the most vari-
able findings, suggesting a potential problem with this item.
Interrater Reliability
Total Hamilton depression scale interrater reliabilities
are displayed in Table 2. Pearson’s r ranged from 0.82 to
TABLE 1. Characteristics of Studies Examining the
Psychometric Properties of the Hamilton Depression Rating
Scalea
(continued)
% of

Female
Subjects
Psychometric Properties
Examined
Study Year Language N Subjects Reliability
Item
Response Validity
O’Hara and Rehm (54) 1983 English 20 0 Depressed outpatients
×
Olsen et al. (55) 2003 Danish 91 74 Psychiatric and medical
patients ×
Onega and Abraham (56) 1997 English 206 70 Geriatric
psychiatric outpatients ×
Pancheri et al. (57) 2002 Italian 186 62 Depressed outpatients ×
×
Paykel (58) 1990
Sample 1 English 101 —b Depressed inpatients × ×
Sample 2 English 118 —b Psychiatric outpatients × ×
Sample 3 English 167 —b General practice outpatients × ×
Potts et al. (59) 1990 English 694 74 Depressed outpatients ×
Ramos-Brieva and
Cordero-Villafafila (60)
1988 Spanish 135 70 Depressed inpatients/outpatients × ×
Rehm and O’Hara (61) 1985 English 158 100 Community
(symptomatic) subjects × ×
Reynolds and Kobak (62) 1995 English 357 59 Psychiatric
outpatient/nonreferred
community subjects
×

Riskind et al. (63) 1987 English 191 54 Psychiatric outpatients
× ×
Santor and Coyne (64) 2001
Sample 1 English 316 —b Primary care outpatients ×
Sample 2 English 318 70 Depressed outpatients ×
Santor and Coyne (65)
Sayer et al. (66)
2001 English 732 —b Depressed patients ×
1993 English 114 61 Psychiatric inpatients × ×
Senra Rivera et al. (67) 2000 Castilian 52 65 Depressed patients
× ×
Shain et al. (68) 1990 English 45 64 Depressed adolescent
inpatients ×
Smouse et al. (69) 1981 English —b —b Depressed patients ×
Steinmeyer and Möller (70) 1992 German 223e 68 Psychiatric
inpatients ×
Steinmeyer and Möller (70) 1992 German 174f 68 Psychiatric
inpatients ×
Strik et al. (71) 2001
Sample 1 Dutch 156 0 Medical patients × ×
Sample 2 Dutch 50 100 Medical patients × ×
Teri and Wagner (72) 1991 English 75 68 Alzheimer’s patients
×
Thase et al. (73) 1983 English 147 100 Depressed outpatients ×
×
Thompson et al. (74) 1998 English 242 100 Psychiatric referrals
×
Whisman et al. (75) 1989 English 70 100 Depressed outpatients
× ×
Williams (76) 1988 English 23 65 Psychiatric inpatients ×
Zheng et al. (77) 1988 Chinese 329 47 Psychiatric
inpatients/outpatients × ×
a Studies were published between January 1980 and May 2003
and identified by means of a MEDLINE search for both

“depression” and
“Hamilton.”
b Not reported.
c Number of subjects providing data at time 1.
d Number of subjects providing follow-up data 3 months after
admission.
e Number of subjects providing baseline (i.e., pretreatment)
data.
f Number of subjects providing endpoint (week 6) data after
treatment with either paroxetine or amitriptyline.
0.98, and the intraclass r ranged from 0.46 to 0.99. Some
investigators provided evidence that the skill level or ex-
pertise of the interviewer and the provision of structured
queries and scoring guidelines affect reliability (19, 23, 35,
54). Across studies, the best estimate mean of interrater re-
liability for studies reporting higher levels of interviewer
skill and use of expert raters, structured queries, and scor-
ing guidelines did not statistically differ from that for other
studies (z=0.81, n.s.).
At the individual item level, interrater reliability is poor
for many items. Cicchetti and Prusoff (19) assessed reli-
ability before treatment initiation and 16 weeks later at
trial end. Only early insomnia was adequately reliable be-
fore treatment, and only depressed mood was adequately
reliable after treatment. Thirteen items had coefficients

<0.50 before treatment, and 11 items had coefficients
<0.50 after treatment. Rehm and O’Hara (61) performed a
similar analysis with data from two samples. Six items
showed adequate reliability in the first sample (early in-
somnia, middle insomnia, late insomnia, somatic anxiety,
gastrointestinal, loss of libido), as did 10 in the second
sample (depressed mood, guilt, suicide, early insomnia,
middle insomnia, late insomnia, work/interests, psychic
anxiety, somatic anxiety, gastrointestinal). Loss of insight
showed the lowest interrater agreement in both samples.
Craig et al. (20) found that only one item, work/interests,
had adequate interrater reliability. Moberg et al. (50) re-
ported that nine items demonstrated adequate reliability
when the standard Hamilton depression scale was admin-
istered (depressed mood, guilt, suicide, early insomnia,
late insomnia, agitation, psychic anxiety, hypochondria-
sis, loss of insight), but all items showed adequate reliabil-
ity when the scale was administered with interview guide-
lines. Potts et al. (59) demonstrated that a single omnibus
coefficient can mask specific problems. Using a structured
interview version of the Hamilton depression scale, they
TABLE 2. Studies Reporting Reliability Estimates for the Total
17-Item Hamilton Depression Rating Scalea
Study Year
Internal Reliability
(Cronbach’s alpha)
(Pearson’s r)
(Intraclass r)

Retest Reliability
(Pearson’s r)
Addington et al. (9) 1990 0.82
Addington et al. (10) 1996 0.93
Akdemir et al. (11) 2001 0.75 0.87–0.98b 0.85
Baca-García et al. (12) 2001 0.97
Time 1 0.46
Time 2 0.82
Craig et al. (20) 1985 0.95
Deluty et al. (22) 1986 0.96
Demitrack et al. (23) 1998 0.65–0.79b
Fuglum et al. (28) 1996 0.86 0.81
Gastpar and Gilsdorf (29) 1990 0.48
Gilley et al. sample 1 (31) 1995 0.92
Gottlieb et al. (32) 1988 0.99
Hammond (34) 1998 0.46
Kobak et al. (37) 1999 0.91 0.98
Koenig et al. (38) 1995 0.97
Leung et al. (42) 1999 0.94
Maier et al. (45) 1988
Sample 1 0.70
Sample 2
Time 1 0.72
Time 2 0.70
McAdams et al. (43) 1996 0.77
Meyer et al. (48) 2001 0.57–0.80b
Middelboe et al. (49) 1994 0.75
O’Hara and Rehm (54) 1983

Expert raters 0.91
Novice raters 0.76
Pancheri et al. (57) 2002 0.90
Potts et al. (59) 1990 0.82 0.92
Ramos-Brieva and Cordero-Villafafila (60) 1988 0.72
Rehm and O’Hara (61) 1985
Study 1 0.76 0.78–0.91b
Study 2 0.91–0.96b
Reynolds and Kobak (62) 1995 0.92 0.96
Riskind et al. (63) 1987 0.73
Shain et al. (68) 1990 0.97
Teri and Wagner (72) 1991 0.65–0.97b
Whisman et al. (75) 1989 0.85
Williams (76) 1988 0.81
Zheng et al. (77) 1988 0.71 0.92
a Estimates are from studies published between January 1980
and May 2003 that measured psychometric properties of the
Hamilton
depression scale. Studies were identified by means of a
MEDLINE search for both “depression” and “Hamilton.”
b Range over multiple pairs of raters.
BAGBY, RYDER, SCHULLER, ET AL.
found an overall intraclass coefficient of 0.92; however,
two trained psychiatrists differed at least 20% of the time

in their ratings of psychic anxiety, psychomotor agitation,
and psychomotor retardation, and they differed by at least
two points 15% of the time in their ratings of loss of libido.
The ratings of trained raters disagreed with the psychia-
trists’ ratings on psychomotor agitation (50% of the time),
hypochondriasis (60%), loss of libido (90%), and loss of
energy (100%).
Retest Reliability
Retest reliability for the Hamilton depression scale
ranged from 0.81 to 0.98 (Table 2). Retest reliability at the
item level (Table 3) ranged from 0.00 to 0.85. Williams (76)
argued in favor of using structured interview guides to
boost item and total scale reliability and developed the
Structured Interview Guide for the Hamilton Depression
Rating Scale. This effort increased the mean retest reliabil-
ity across individual items to 0.54, although only four
items met the criteria for adequate reliability (depressed
mood, early insomnia, psychic anxiety, and loss of libido).
Item Characteristics
Content and scaling. Standard psychometric practice
dictates that items within an instrument should measure a
single symptom and contain response options linked to
increasing or decreasing amounts of that symptom. Each
item is assumed to contribute equally to the total score or
be backed with evidence in support of differential weight-
ing. These criteria are not consistently met by using the
current scaling procedure or the options for rating symp-
toms. Although improperly scaled items can cause prob-
lems in quantitative measurement, evaluation of item
scaling takes place first at a qualitative level. Some Hamil-
ton depression scale items measure single symptoms
along a meaningful continuum of severity; many do not.

The item assessing depressed mood includes a combina-
tion of affective, behavioral, and cognitive features, such
as gloomy attitude, pessimism about the future, subjective
feeling of sadness, and tendency to weep. The general so-
matic symptoms item, which is also symptomatically het-
erogeneous, includes feelings of heaviness, diffuse back-
ache, and loss of energy. Headache is coded only as part of
somatic anxiety along with such symptoms as indigestion,
palpitations, and respiratory difficulties. Genital symp-
toms for women entail loss of libido and menstrual distur-
bances. The problems inherent in the heterogeneity of
these rating descriptors reduce the potential meaningful-
ness of these items, a problem exacerbated if the different
components of an item actually measure multiple con-
structs and thus measure different effects.
Most items on the Hamilton depression scale at least are
scaled so that increasing scores represent increasing se-
verity. It is less clear whether the anchors used for different
scores on certain items actually assess the same underly-
ing construct/syndrome. This ambiguity is most obvious
for severity ratings involving psychotic features. The feel-
ings of guilt item, for example, is graded as follows: 0=ab-
sent, 1=self-reproach, 2=ideas of guilt or rumination over
past errors or sinful deeds, 3=present illness is a punish-
ment, and 4=hears accusatory or denunciatory voices
and/or experiences threatening visual hallucinations. A
patient with guilt-themed hallucinations may be more se-
verely ill than a patient who has nonpsychotic guilty feel-
ings, but is he/she feeling more guilt? The psychotic fea-
tures may instead represent a qualitatively different
construct/syndrome associated with more severe illness.
Similarly, the hypochondriasis item progresses through
bodily self-absorption (rated 1) and preoccupation with
health (rated 2) before switching to querulous attitude

(rated 3) and then again to hypochondriacal delusions
(rated 4). These item-scoring anchors violate basic mea-
surement principles, because nominal scaling and ordinal
scaling are combined in a single item.
Although Hamilton (1) explained the rationale for the
inclusion of both 3-point and 5-point items, the argument
was not made on the grounds of differential weighting.
Hamilton believed that certain items would be difficult to
anchor dimensionally and therefore assigned them fewer
response options. The end result is that certain items con-
tribute more to the total score than others. Contrasting
psychomotor retardation and psychomotor agitation, for
example, reveals that a severe manifestation of the former
contributes 4 points, whereas an equally severe manifes-
tation of the latter contributes 2 points. Similarly, some-
one who weeps all the time can contribute 3 or 4 points on
depressed mood, whereas someone who feels tired all the
time can contribute only 2 points on the general somatic
symptoms item.
Item Response Analysis
A psychiatric rating scale should measure a single psy-
chopathological construct (i.e., an illness or syndrome)
and be composed of items that adequately cover a range of
symptoms that are consistently associated with the syn-
drome. Item response theory, a method used increasingly
in the evaluation and construction of psychometric in-
struments, permits empirical evaluation of these pre-
mises. It is important to note that this method was not
available when the original Hamilton depression scale was
developed, although some researchers more recently used
this method to evaluate this instrument. According to item
response theory, a scale and its constituent items may
have good reliability estimates but still fail to meet item re-

sponse theory criteria. For example, if a depression scale
were composed only of items measuring mild depression,
the instrument would have great difficulty distinguishing
between moderate and severe cases of depression, as both
would be characterized by high scores on all items. This is-
sue is particularly pressing in studies of clinical change;
not only is a wide range of severity often represented in
this research, but individual patients are expected to move
along this continuum as they improve. Continued use of
items insensitive to change underestimates the strength of
actual treatment effects and makes it necessary to have
larger samples to demonstrate that an effect is statistically
significant. Falsely identifying patients as not having
changed represents an additional source of “noise” and
weakens the “signal” of a true treatment effect. A prag-
matic implication of such lack of sensitivity is that new
compounds shown to be promising in the laboratory may
appear spuriously ineffective in clinical trials.
A related issue concerns the extent to which a severity
score actually measures a single unidimensional syn-
drome. To summarize a syndrome with a single score re-
quires a precise understanding of what that score repre-
sents. The implicit assumption is that the severity score
represents a single dimension (84); if depression is hetero-
geneous, interpretation of a single summed score is un-
clear. If, for example, items assessing psychological and

physical symptoms were only loosely related, a single
score would not distinguish between two potentially dif-
ferent groups of depressed patients—one group whose
symptoms were primarily psychological and another
group with primarily vegetative symptoms. Any effects of
an intervention targeting only one of these aspects would
be harder to detect.
Gibbons et al. (85) presented a strategy for identifying a
unidimensional set of items from a psychiatric rating scale
and evaluating the extent to which these items adequately
measure the full range of depression severity. Subse-
quently, a subset of Hamilton depression scale items that
would measure a single dimension of depression across a
wide range of severity was developed (30). This subset in-
cluded depressed mood, which was sensitive at low levels;
work/interests, psychic anxiety, and loss of libido, which
were sensitive at mild levels; somatic anxiety, psychomo-
tor agitation, and guilt, which were sensitive at moderate
levels; and suicide, which was sensitive at severe levels.
These items were proposed as a psychometrically stronger
form of the full Hamilton depression scale.
Santor and Coyne (64, 65) used item response theory to
examine the functioning of the full Hamilton depression
scale and its individual items. In one of these studies (65)
they examined individual Hamilton depression scale item
performance in a combined sample of primary care pa-
tients and depressed patients from the National Institute
of Mental Health Treatment of Depression Collaborative
Research Program. One expects different item ratings at
TABLE 3. Studies Reporting Item Reliability Estimates for the
17-Item Hamilton Depression Rating Scalea

Scale Item
Reliability Measure and Study Year
Depressed
Mood Guilt Suicide
Early
Insomnia
Middle
Insomnia
Late
Insomnia
Work/
Interests
Internal reliabilityb
Berrios and Bulbena-Villarasa (16) 1990
Sample 1 0.32 0.24 0.26 0.25 0.32 0.31 0.39
Sample 2 0.37 0.38 0.40 0.23 0.37 0.42 0.33
Gastpar and Gilsdorf (29) 1990
Time 1 0.10 0.22 –0.04 0.04 0.22 0.13 0.09
Time 2 0.65 0.39 0.50 0.44 0.46 0.53 0.73
Paykel (58) 1990
Sample 1 0.52 0.31 0.31 0.24 0.21 0.38 0.59
Sample 2 0.42 0.38 0.47 0.27 0.34 0.30 0.58
Sample 3 0.52 0.41 0.49 0.34 0.35 0.34 0.59
Rehm and O’Hara (61) 1985 0.63 0.26 0.47 0.40 0.41 0.37 0.46
Interrater reliabilityc

Time 1 0.37 0.18 0.59 0.76 0.57 0.42 0.33
Time 2 0.72 0.37 0.64 0.57 0.45 0.49 0.64
Moberg et al. (50)d 2001
Standard administration 0.90 0.80 0.90 0.61 0.39 0.89 0.50
Interview guidelines 0.96 0.83 0.81 0.97 0.78 0.89 0.87
Rehm and O’Hara (61)e 1985
Above median split 0.61 0.39 0.49 0.74 0.79 0.72 0.56
Below median split 0.84 0.82 0.92 0.91 0.79 0.92 0.73
Retest reliabilityf
Akdemir et al. (11) 2001 0.61 0.78 0.67 0.69 0.79 0.76 0.73
Williams (76) 1988 0.80 0.63 0.64 0.80 0.62 0.30 0.54
a Estimates are from studies published between January 1980
and May 2003 that measured psychometric properties of the
Hamilton depres-
sion scale. Studies were identified by means of a MEDLINE
search for both “depression” and “Hamilton.”
b Correlation of item scores with total scores. An uncorrected
Pearson’s r>0.20 was considered significant. Significant
correlations are shown in
boldface type.
OK Following on from the previous. Short podcast This is
another one looking at scales for depression so another famous
scale that's used for measuring depression has been for a long
time Hamilton depression rating scale factors for one so. Yes it
uses. To evaluate. And to depression medications still. So let's
take a look at this article this is this is another article. That has
a good foundation study in this case of the Hamilton scale. So.
Again your. Introduction here. Around one has been the gold

standard for the assessment of depression for 2 years when this
was published which of us was. 15. And 50 years. And. Life the
infected person in the entry This was also developed a much
earlier actually in the late 1950 S. to assess the effectiveness of
the 1st generation of antidepressants it was a vision we
published in 1960 so again it was it was not developed to
measure depression and clients so much as to measure the
effectiveness of antidepressant medications. And. It's now. That
the most commonly used measure of depression so there are
some there often issues come by the minute. Related to the fact
that this that perhaps. This scale. Was developed by now
conceptualize ation of depression might have been something
different from one of his today all right so the last major review
was found 950 of this I'm not sure if you show me 2003 so this
is. Update 201517 items will question is on the sky right of Sky
question. Section and let's take a look at the reliability of the
Hamilton so we've got some. Information here. And looking
reasonably good this is kind of putting together a list of many
studies. Saying. If you know you know what information was
contained within. You in those studies and. So it's talking about
reliability and some statistics being used to cases are. So. In
terms of internal life of the sea. The estimates range from point
462.97.97 courses 5 fabulous to the excellent point 46 it's a
little bit. On the low and. But point 7 is his closer to accept to
be the best we're looking for really his point 8 while high.
Interests are a liability and this is this is a very important type
of reliability this means. That you know because this is what
happens in real life creations. Are using. In this case the
Hamilton depression scale are they. Getting the same result.
When they use this chaos and other clinicians so the pieces are
here ranged from point A to point 98 that was good very good.
But the individual item level so that was really the whole sky
off here the size of the individual lots of novel into a survival
using whole for many items and so that's. How some calls for
concern. So here. This is the board state psychiatric rating scale
should measure a single psycho form of psychopathological

construct. You know. What we refer to. In the medical model
we use as an illness or a syndrome and whether they are
actually medical illnesses is another matter. Mentioned in the
other. Video and become part of items that adequately cover a
range of symptoms that are consistently associated with the
syndrome so again we're coming back to the fact that. This scale
is. Whether or not they come back to the question should say
whether or not this scale. Catches depression as it is currently
conceptualize this currently conceptualize. Her to death
Nowadays the D.S.M. 5. We've already talked about whether or
not the D.S.M. itself is valid and reliable. But that's what we
take history. As the standard and so on a depression scale is
then measure it against it's ability it's the ability of the scale is
measured in terms of. How well it's able to capture what we
know we conceptualize depression per the D.S.M. diagnostic
criteria so here it's this paragraph here it's reasonable to ask
whether this instrument catches depression as it is currently
conceptualize several symptoms contained within the how time
scale and not official D.S.M. diagnostic criteria although they
are recognizes speech is associated with depression for example
psyche XYZZY or other symptoms including included I don't
and scale for example loss of inside hypochondriasis the link
with depression is more tenuous war critically important
features of D.S.M. for depression are often buried within more
complex items and sometimes in all kinds of toll so it could be a
problem since we are obliged in practice to use the D.S.M.. We
need we need to VO to use in a major scale that's going to give
us. A score that tells us whether or not the client has depression
as it is defined in the D.S.M.. And no explicit assessment of
feelings of worthlessness. OK she's something there in the
D.S.M. knowledge. Doesn't capture. All right so anyway. Let's
take a look at 2. So this was the conclusion or one of the
confluence of this particular study was that the Hamilton
depression scale is measuring concepts in a depression that is
now several decades old and that is at best partly related to the
operationalize ation of depression in D.S.M. 4 so maybe even

be. Further removed from how the D.S.M. 5 sexualizes
depression. Huge difference between D.S.M. $4.00 and $5.00.
But nevertheless it's a concern. So then we take a look. Now.
And think. So we can see how these 2 scales are. Different from
each other even though they. May be same thing. Well 1st of all
the number of items is different. Here. If we look at the way
that. Both in both scales we add up the total score you see here
that the. Back to Persian inventory. It says the cutoff of. 0 to 13
is minimal depression was this there's no score that tells you no
depression. I see interesting whereas at least I have also and.
You can not be depressed with a house. And then $8.00 to
$12.00 when the House of depression is doubtful and Miles
Wilder and severe. Which kind of sort of seems to equate to a
mild more drip sit here on. The. The Bay It's the same word to
use but whether it's. They are made the same meaning in terms
of actual depressed state in the plan or less. Matter. But also
what you can see here is some of the. Items have effectively a
different $1080.00 in the houses scale whereas as we've
mentioned previously we're talking about a bank that each He
chides in his weighted equally So these these since we're adding
the total score these items. Can these 4 from 0 to 2. Kind of
have less weighting than those that are. Rated from 0 to 4. We
should give that one for depressed mood as you can give this
some new initiatives on. Maximum 2 So anyway that the the
scales are different and so. The question is are they how they
measuring the same thing and we've looked at the lives of the
few who did see off the top of them in those elevations studies
and so that would be what we're trying to make a decision as to
which scale we're going to use. Those And those would be the
deciding factor is the size things like practical things like ease
of use and so on. And of course you where if we're planning to.
Major Depression say in. Children. The wording of the question
say. I think that. Well one would have to consider whether the
would be is going to be understood as it's meant to be
understood by a child this is an adult. So you eat what for your
for your assignment. Let's say they say you choose. Depression

chosen depression is on of your clients main problems so then
you can use. Something like Google Scholar through V.F.S.
universe and you can search for depression scale see what
comes out or depression in Tory and you probably get lots of
hits. Back and. Saw So the 1st step is to find scales. That
seethes if they are. Potentially useful for what you want to do.
Then go for validation studies of those particular scales once
you got some of the scales and you go to whittle this down to 2
scales per problem so if you chosen depression you need to have
2 scales you could even use that back in the house and compare
those 2. But again I ask you not to use the validation studies
that I've presented these videos but find some feel. Free So for
each scale you have to find 2. Validation studies so to fit in this
case you need 2 for the bank and 2 for the Hamilton OK and you
look at the. Reliability of booty statistics and used those and
things like ease of use. To argue that OK I'm going to select the
best because blah blah blah on the. Hamilton because all and
give your reasons in terms of statistics and other factors that.
Make that a better choice to go client OK All right you also
have. Some work on this in your discussion board aside for this
much. So.
International Journal of Mental Health Nursing (2007) 16, 108 –
115 doi: 10.1111/j.1447-0349.2007.00453.x
© 2007 Australian College of Mental Health Nurses Inc.
Feature Article
Measuring melancholy: A critique of the Beck
Depression Inventory and its use in mental
health nursing
Brad Hagen

School of Health Sciences, The University of Lethbridge,
Lethbridge, Alberta, Canada
ABSTRACT: The Beck Depression Inventory (BDI) is one of
the most commonly used depression
measurement instruments. Mental health nurses often utilize the
BDI to assess the level of depression
in clients, and to monitor the effectiveness of treatments such as
antidepressants and electroconvulsive
therapy. Despite the widespread use of the BDI in both clinical
practice and research, there is
surprisingly little nursing literature critically examining the
BDI or its use by mental health nurses.
This paper reviews the origins, purpose, and format of the BDI,
discusses some of the strengths and
limitations of the BDI, and concludes with some implications
for mental health nursing.
KEY WORDS: Beck Depression Inventory, depression, mental
health, psychiatric nursing,
psychometrics.
Blackwell Publishing AsiaMelbourne,
AustraliaINMInternational Journal of Mental Health
Nursing1445-8330© 2007 Australian College of Mental Health
Nurses Inc.? 2007162108115Original ArticleBECK
DEPRESSION
INVENTORYB. HAGEN
Correspondence: Brad Hagen, School of Health Sciences, The
University of Lethbridge, 4401 University Drive, Lethbridge,
AB
T1K 3M4, Canada. Email: [email protected]
Brad Hagen, RN, PhD.
Accepted May 2006.

Mental health nurses are often encouraged to use psychi-
atric/psychological measurement instruments in their
nursing practice, and the Beck Depression Inventory
(BDI; Beck et al. 1996) is one of the most commonly used
instruments that mental health nurses are likely to
encounter and use in their practice (Demyttenaere & De
Fruyt 2003). The majority of current psychiatric nursing
textbooks discuss the BDI, and in one commonly used
textbook, the BDI is described as ‘. . . a quick but reliable
and valid measure of the extent to which depression may
be present’ (Kneisl et al. 2004; p. 169). Indeed, in the first
author’s own local health region, where he supervises
nursing students in mental health clinical settings, nurses
commonly use the BDI to assess the level of depression
in patients, and to monitor the effectiveness of treatments
such as antidepressants and electroconvulsive therapy.
Yet despite the common use of the BDI by mental
health nurses, there is little or no nursing literature crit-
ically examining the BDI, or its use by mental health
nurses. Therefore, the purpose of this paper is to provide
a critical discussion of the BDI, and its use by mental
health nurses. To this end, the author will briefly review
the origins, purpose, and format of the BDI, discuss some
of the strengths and limitations of the BDI, and conclude
with some implications for mental health nursing.
ORIGINS, PURPOSE, AND FORMAT OF
THE BDI
As Demyttenaere and De Fruyt (2003) have described in
their review of depression rating scales, depression rating
scales were first developed in the late 1950s as part of the
overall psychopharmacology revolution, whereby psycho-
logical theories of depression gave way to commercially

driven biochemical theories of depression. They go on to
note that ‘. . . with the advent of antidepressant drugs,
rating scales were needed to measure the severity of the
depressive disorder and the changes during therapy’
(Demyttenaere & De Fruyt 2003; p. 61), and there are
now more than 100 depression rating scales in existence.
While there are many available depression rating scales,
the BDI, a self-administered scale first developed in 1961
tgomory
Highlight
tgomory
Highlight
BECK DEPRESSION INVENTORY 109
(Beck et al. 1961), has risen to prominence as one of the
most widely used depression rating scales (Dozois &
Covin 2004).
Beck et al. (1961) developed the original BDI as a 21-
item inventory to measure the severity of depressive
symptoms, and based the 21 items on Beck’s observations
of the symptoms and attitudes of depressed persons seen
in the context of therapy. Beck has claimed that the BDI
does not reflect any particular theory of depression, and
merely reflects the observed symptoms of persons who
are depressed (Beck et al. 1996). While the first version
of the BDI was shown to be reasonably robust in terms
of psychometric properties, increasing concerns were
raised concerning the instrument’s validity vis-à-vis the

DSM-IV standard for diagnosing depression (American
Psychiatric Association 1994). Consequently, Beck et al.
created a second revised version of the BDI (BDI-II) in
1996 (Beck et al. 1996). The main changes made to
develop the BDI-II primarily reflected increased com-
patibility with the DSM-IV, and included the changing of
certain items, dropping of other items, and changes to
certain response options and time frames (Beck et al.
1996; Dozois et al. 1998). From this point forwards in
this manuscript, the authors will use the term ‘BDI’ to
refer to this most recent version of the instrument, the
BDI-II.
The BDI is typically self-administered, requires only
about 5–10 min to complete, and can be used with per-
sons aged 13 years and up (Dozois & Covin 2004). Each
one of the 21 items in the BDI is rated on a scale of 0–
3, and scores from all items are tallied to obtain a total
possible score, ranging from 0 and 63, with higher scores
reflecting greater severity of depressive symptomatology.
Scores between 0 and 13 are interpreted as ‘minimal’
depression, scores between 14 and 19 as ‘mild’ depres-
sion, scores of 20–28 as ‘moderate’ depression, and scores
of 29–63 as ‘severe’ depression (Beck et al. 1996; Dozois
& Covin 2004). Interestingly, it appears that with the
possible exception of a score of 0, there are no score
grouping to be interpreted as ‘no depression’.
STRENGTHS OF THE BDI
Strengths of the BDI include the ease of administration
and scoring of the BDI, its widespread use, and the
results of psychometric testing of the reliability and valid-
ity of the BDI.
Ease of administering and scoring the BDI

One of the principle advantages of the BDI is its ease of
administration and scoring (Dozois & Covin 2004).
Indeed, the BDI generally only takes less than 10 min to
complete, and is easily scored and interpreted. Conse-
quently, the BDI has become one of the most widely used
psychological tests, has been translated into many lan-
guages, and has been employed in more than 2000 empir-
ical studies (Barroso & Sandelowski 2001; Dozois &
Covin 2004; Richter et al. 1998).
Psychometric testing of the BDI
Reliability of the BDI
Although more reliability testing has been completed on
the original BDI than the BDI-II, both are considered to
be generally quite reliable (Dozois et al. 1998; Richter
et al. 1998). The original manual for the BDI-II reported
high internal consistency, with a coefficient alpha of 0.93
for college students, and 0.92 for psychiatric outpatients
(Beck et al. 1996). More recently, Dozois and Covin
(2004) reviewed 13 studies reporting reliability data on
the BDI-II since 1996, and reported an average coeffi-
cient alpha of 0.91. Less information is available on the
test–retest reliability of the BDI-II, although the original
manual reports a 1-week test–retest reliability coefficient
of 0.93 with 26 psychiatric outpatients (Beck et al. 1996).
As Dozois and Covin (2004) have cautioned, however,
test–retest reliability is difficult to interpret on a measure
that is supposed to both reliably measure depression and
detect changes in depression due to treatment. For
example, at least one group of researchers have sug-
gested that the BDI may not be reliable for longer peri-
ods of time in non-clinical samples, after finding that
BDI scores declined by 40% over 2 months in a non-
clinical sample (Ahava et al. 1998). Such a significant
downward drift in BDI scores in non-clinical samples

clearly poses a threat to the instrument’s ability to reli-
ably detect changes in depression due to treatment
alone.
Validity of the BDI
Dozois and Covin (2004) have asserted that while the
BDI is comparable to the original BDI in terms of reli-
ability, the BDI-II is ‘. . . a clearly superior instrument in
terms of its validity’ (p. 53). Such claims for the higher
validity of the BDI-II are made on a number of levels. To
begin with, the content validity and the face validity of
the BDI-II are argued to be very high, because the items
in the BDI-II now closely mirror the standard DSM-IV
diagnostic criteria for depression (Dozois & Covin 2004;
Richter et al. 1998). The convergent validity of the BDI-
II has also been reported, and the BDI-II appears to
correlate fairly well with other depression rating scales,
tgomory
Highlight
tgomory
Highlight
tgomory
Highlight
tgomory
Highlight
tgomory
Highlight
tgomory
Highlight

tgomory
Highlight
tgomory
Highlight
tgomory
Highlight
tgomory
Highlight
110 B. HAGEN
such as the original BDI-I (r = 0.93), the Hamilton Rating
Scale for depression (r = 0.71), and the Beck Hopeless-
ness Scale (r = 0.68) (Dozois & Covin 2004).
The level of discriminate validity for the BDI is less
clear. For example, while Richter et al. (1998) have con-
cluded that the BDI ‘. . . discriminates reliably between
depressives and non-depressives’ (p. 162), Dozois and
Covin (2004) came to the opposite conclusion, stating that
the BDI-II does not differentiate well between depressed
and non-depressed persons. Furthermore, it has been
noted that the BDI-II correlates highly with other mea-
sures of anxiety, and may not be able to reliably distin-
guish between depression and other affective states such
as anxiety (Dozois & Covin 2004).
Finally, it has been suggested that the BDI-II contains
a reasonably stable factor structure (Beck et al. 1996;

Dozois & Covin 2004). When the BDI-II was first
released, Beck et al. (1996) reported a two-factor solu-
tion: somatic-affective and cognitive symptoms within a
psychiatric outpatient sample, and cognitive-affective and
somatic symptoms with a college student sample. Other
researchers have since found generally similar two-factor
structures in other studies using college students (Dozois
et al. 1998) and primary care medical patients (Arnau
et al. 2001). It should be noted, however, that in their
review of the BDI, Richter et al. (1998) concluded that
the tacit factorial validity of the BDI is in fact controver-
sial, and that subtle but possibly important differences
exist in the factor structure of the BDI, depending upon
the kinds of subjects that complete the BDI.
In summary, the main support for the BDI appears to
lie in its ease of use, widespread utilization, very good
internal reliability, high content validity when compared
with the DSM-IV criteria for depression, good conver-
gent validity with other similar depression rating scales,
and a somewhat stable factor structure.
LIMITATIONS OF THE BDI
While the BDI is well-known and widely used by mental
health nurses, and while the BDI has several strengths,
there is little critical discussion in the nursing literature
of some of the potential limitations of the use of the BDI
in general, or by mental health nurses in particular. Some
of the potential limitations of the BDI include: issues
related to norms (including potential bias issues); prob-
lems with the wording, ordering, and weighting of the
BDI items; potential gender biases; theoretical issues
with the BDI; potentially inappropriate uses of the BDI;
and validity issues related to the DSM-IV criteria for
depression, upon which the BDI is based.

Norms and bias issues
The BDI has no actual large population norms per se, so
it is difficult to determine if any given individual’s level of
depression, as determined by the BDI, is ‘normal’ in any
sense of the word. Instead, the interpretation of the BDI
is referenced to criterion based on the original standard-
ized sample of 500 persons (317 women and 183 men) in
the Eastern United States (Beck et al. 1996). Based on
this sample, the authors of the manual for the BDI-II
offered cut-off score criterion or guidelines to distinguish
between minimal, mild, moderate, and severe amounts of
depression. However, while the total possible scores
range from 0 to 63, the scoring of the scale is very ‘bottom
heavy’. That is, the mean score for severely depressed
persons in the standardized sample (32.96) is approxi-
mately half-way along the range of total possible scores,
and anyone who scores anywhere from 29 to 63 is con-
sidered to be ‘severely’ depressed (Beck et al. 1996).
As several authors have noted, the original sample
upon which the BDI-II was standardized was predomi-
nantly Caucasian, and is greatly misrepresentative of the
US population at large (Dozois & Covin 2004; Richter
et al. 1998). Obviously, this kind of sample also renders
the BDI generally misrepresentative of other countries
and cultures (Dozois & Covin 2004), and fails to capture
the many different cultural factors influencing how
depression is experienced by different ethnic and cultural
groups (Falicov 2003). Finally, women tend to score
higher on the BDI than men (Beck et al. 1996) and items
on the BDI such as ‘crying’ may contain a gender bias,
and may hold very different meanings for men as opposed
to women (Barroso & Sandelowski 2001).
Item-related issues

There are also several problems with the way that items
contained in the BDI are worded, ordered, and weighted.
To begin with, several authors (Barroso & Sandelowski
2001; Demyttenaere & De Fruyt 2003; Richter et al.
1998) have noted that the BDI item response options,
most of which contain some combination of negatively
and positively worded options, can be very confusing and
misleading for persons taking the BDI. In addition, the
responses are only ordinal-level data, with unequal inter-
vals between options, yet are tallied up, analysed, and
reported as if they are ratio-level data (Burns & Grove
2001). There is also a tendency for responses on each item
to score quite low. That is, although potential scores for
each item range from 0 to 3, studies in non-clinical (stu-
dent) samples typically report average scores below 1, and
even psychiatric samples mean item scores rarely exceed
values of 2 (Richter et al. 1998).
tgomory
Highlight
tgomory
Highlight
tgomory
Highlight
tgomory
Highlight
tgomory
Highlight
tgomory
Highlight

tgomory
Highlight
tgomory
Highlight
The BDI also has an obvious consistency in the order-
ing of responses, as each response is ordered from least
to most depressed. Therefore, several authors have noted
that the obvious ordering of answers in the BDI may lead
to responses reflective of faking, social desirability, and/
or defensiveness, as opposed to depression per se (Bar-
roso & Sandelowski 2001; Dozois & Covin 2004). This
problem is compounded by the BDI’s high face validity,
which makes it easy for subjects to guess which of the
items are reflective of greater or lesser depression.
Finally, there is the problem of item weighting within
the BDI. As Healy (1997) has noted, the creators of the
BDI arbitrarily decided that the total BDI depression
score would be generated by simply adding the individual
scores for each of the 21 items. This kind of scoring
system raises the issue of whether or not it is legitimate
to simply add dissimilar items to produce a total score,
and subsequently assume that the total score actually
‘means’ something (this process is called reification, and
shall be addressed further in this paper). This scoring
process also raises the issue of whether or not each added
item in the BDI should have equal weighting in relation
to the other items. Healy (1997) has commented on this

problem, and has noted that ‘. . . many clinicians had dif-
ficulties with the idea that items of very different meaning
could simply be summed. Should early morning awaken-
ing be counted in the same balance as guilt or suicidality?’
(p. 98). Yet despite such criticisms, all items of the BDI
continue to be treated as if they are of equal importance
in determining a person’s level of depression, and the
creators of the BDI have offered no justification for such
a stance.
Theoretical issues with the BDI
There are a number of important theoretical limitations
with the BDI as well. First and foremost is the problem
with the supposedly atheoretical nature of the BDI. That
is, although the creators of the BDI maintain that the
BDI merely reflects the symptoms and attitudes typically
found in persons with depression – and does not reflect
any theoretical assumptions about depression (Beck et al.
1996; Dozois & Covin 2004) – other authors have chal-
lenged this claim of theoretical neutrality. Demyttenaere
and De Fruyt (2003), for example, have noted that the
BDI clearly reflects a distinctly cognitive–behavioural
perspective. This perspective is not surprising, given that
Beck et al. were primarily responsible for the creation of
cognitive therapy. Healy (1997) has also observed that it
is probably more than coincidence that the BDI is partic-
ularly well-suited for evaluating cognitive–behavioural
therapy, and that it would be very difficult for a person
who has gone through cognitive therapy not to recognize
many of the terms and language used in the BDI. There-
fore, the assertion that the BDI is theoretically neutral of
bias-free is simply not true, nor should this necessarily
be surprising. As Jensen and Hoagwood (1997) have
emphasized:

. . . it should be noted that all clinicians – indeed, all
human beings – bring theory-laden perspectives and con-
ceptual filters to their assessment and diagnostic
approaches with a given patient. They differ principally
in the explicitness, rigidity and awareness of their biases
(p. 235).
Perhaps one of the most important source of bias
found within the BDI is reflected in what the architects
of the BDI chose not to include as items in the tool. For
example, the BDI focuses exclusively on negative symp-
tomatology – such as sadness, guilt, and feeling like a
failure – and no positive experiences symptoms are
included, despite research suggesting that positive mood
may well be superior to negative mood in predicting
outcomes from depression (Demyttenaere & De Fruyt
2003). In addition, the creators of the BDI chose to dis-
regard large areas of interpersonal functioning, and many
of the factors which determine quality of life for individ-
uals (Healy 1997). Finally, Beck et al. have selected items
for the BDI that clearly reflect a theoretical stance
whereby the problem (i.e. depression) is seen to lie within
the individual. By focusing exclusively on symptoms or
problems inside the person, the BDI explicitly disregards
all the multitude of factors and problems external to the
individual that may be clearly impacting his or her level
of depression, such as unemployment, discrimination
and/or domestic violence (Crowe 2000; Jensen & Hoag-
wood 1997).
Lastly, the BDI exhibits the theoretical problem of
reification, or the tendency to view abstract concepts as
actual entities. That is, the creators of the BDI would
have us believe that the simple process of adding up the
answers to 21 questions about various symptoms and atti-
tudes allows us to measure, with a single number, the

quantity of a reliably identifiable ‘thing’ called depression,
as if we were measuring the weight or height of an indi-
vidual. Yet as Gould (1996) has pointed out, measuring
and reifying such concepts as ‘depression’ and ‘intelli-
gence’, as if they have a definite existence of their own,
can be very misleading. Not only can such reification
oversimplify complex and multifaceted experiences like
depression, but such reification also disregards the large
extent to which such concepts are socially created and
defined, and fail to actually reflect any clearly tangible and
tgomory
Highlight
tgomory
Highlight
tgomory
Highlight
tgomory
Highlight
tgomory
Highlight
tgomory
Highlight
tgomory
Highlight
tgomory
Highlight

tgomory
Highlight
tgomory
Highlight
tgomory
Highlight
tgomory
Highlight
tgomory
Highlight
112 B. HAGEN
unambiguous entities, such as height, weight, or hyper-
tension (Sarbin 1997).
Inappropriate uses of the BDI
Additional concern has also been raised regarding the use
of potentially controversial ‘spin-offs’ of the BDI, as well
as the manner in which the BDI is being increasingly used
to ‘screen’ for depression. With regards to spin-offs,
Dozois and Covin (2004) have noted the increasing use
of controversial spin-offs of the BDI, such as the ‘short
form’ of the BDI (Furlanetto, Mendlowicz & Romildo
2005). The increasing use of such spin-offs, because of
their comparative lack of psychometric testing and estab-
lished norms, only further compounds the overall issues
of reliability and validity of the BDI and other depression
rating scales.

Furthermore, increasing numbers of clinicians, includ-
ing nurses, are beginning to use the BDI as a depression
‘screening tool’, particularly now that the BDI-II closely
mirrors the DSM diagnostic criteria for depression (Lasa
et al. 2000). However, despite the fact that the creators
of the BDI specifically specified that the BDI was not to
be used as a diagnostic tool, and was to only be used as a
measurement of depressive symptom severity (Beck et al.
1996; Dozois & Covin 2004), the demarcation lines
between measuring symptoms of depression, diagnosing
depression, and screening for depression have never been
clear, and are becoming even less clear. In particular, it
was never conceptually clear why the BDI – an instru-
ment apparently able to measure the quantity of depres-
sion – could not actually determine the presence of
depression or not (i.e. diagnosis it), particularly when so
many of the BDI items used to measure depression symp-
toms were so similar to the same DSM-IV diagnostic
criteria for depression. Despite this logical inconsistency,
the BDI is increasingly being used not only to measure
depression, but to detect or diagnosis it as well (Lasa et al.
2000), despite a lack of clear validation for doing so (Beck
et al. 1996).
Validity issues and the DSM
When Beck et al. attempted to increase the validity of the
BDI by making the BDI-II more closely mirror DSM-IV
diagnostic criteria for depression (Beck et al. 1996), they
also further reinforced the common assumption that the
DSM-IV offers the most valid definition and description
of the experience of depression. By doing so, however,
they not only overlooked considerable criticism of the way
that the DSM-IV authors categorize depression, but also
inherited many of the limitations of the DSM-IV descrip-
tion of depression (Beutler & Malik 2002; Crowe 2000;

Donald 2001; Eriksen & Kress 2005; Jensen & Hoagwood
1997; Sarbin 1997). Therefore, any examination of the
limitations of depression scales like the BDI must also
include an examination of the limitations of the DSM-IV
criteria. These limitations include issues of reliability and
validity of the DSM-IV, and issues of DSM-IV value
judgements and biases.
Issues of the reliability and validity of the DSM-IV
While many mental health clinicians simply take the reli-
ability and validity of the DSM-IV system for granted,
closer examination often finds the reliability and validity
of the DSM-IV wanting. In fact, the reliability of the
diagnosis of major depression is quite poor, and research-
ers have reported kappa coefficients for the diagnosis of
depression as low as 0.25 (Parker 2005). As Beutler and
Malik (2002) have observed, this inadequate level of diag-
nostic reliability is not surprising, given the ambiguous
and complex set of guidelines that the authors of the
DSM-IV created to diagnosis depression. In fact, the
DSM-IV criteria for depression literally allow for several
hundred possible different patterns or clusters of symp-
toms, all of which can still all meet the DSM-IV diag-
nostic criteria for depression (American Psychiatric
Association 1994).
Given the myriad of symptom patterns which can qual-
ify for a DSM-IV diagnosis of depression, the DSM-IV
diagnosis of depression suffers not only from reliability
problems, but from considerable validity problems as
well. For example, all forms of depression share great
overlap with numerous other psychiatric diagnosis con-
tained within the DSM-IV, and depression is found to be
comorbid in 60% of general psychiatric patients, and in
40% of patients diagnosed with anxiety disorders (Beutler

& Malik 2002). In addition, another key indicator of diag-
nostic validity – that different diagnoses would respond
differently and predictably to prescribed treatments – is
also lacking. That is, both the natural and treatment his-
tories of persons with a diagnosis of depression are noto-
riously hard to predict (Parker 2005). Furthermore, a
wide variety of treatments with entirely different theoret-
ical explanations – including antidepressants, different
forms of counselling, electroconvulsive therapy, St. John’s
Wort, exercise, and placebo – all have nearly identical
efficacy levels, seriously challenging the diagnostic con-
struct of depression implicitly contained within the BDI
(Parker 2005).
Value judgements and biases within the DSM
As opposed to other medical diagnoses, the diagnoses
contained within the DSM-IV (including the diagnosis of
tgomory
Highlight
tgomory
Highlight
depression) are not arrived at with the aid of laboratory
tests or diagnostic imaging, but rely instead on a clinician’s
judgement as to whether or not certain behaviours and
symptoms, such as feelings of guilt or loss of concentra-
tion, are present in certain prescribed patterns for certain
prescribed periods of time (Eriksen & Kress 2005). These

prescriptions for what constitutes a diagnosis (or disorder)
are in turn arrived at by consensus by committees and
panels of psychiatric experts associated with the American
Psychiatric Association (1994).
Numerous authors have challenged this DSM
diagnosis-by-consensus process, claiming that the pro-
cess reflects not so much a scientific and objective pro-
cess, but a process whereby the values and biases of the
privileged few comprising ‘expert consensus’ panels
become embedded in our society’s definitions of mental
disorders (Beutler & Malik 2002; Eriksen & Kress 2005;
Jensen & Hoagwood 1997; Kutchins & Kirk 1997; Sarbin
1997). Female scholars in particular (Caplan 1995; Crowe
2000; Russell 1986) have noted the preponderance of
upper-middle and upper class men in DSM diagnostic
expert committees, and have suggested that Western,
male, and upper/middle class values strongly influence
decisions regarding diagnoses such as depression, and
how such diagnoses are applied. For example, using the
standard DSM-IV criteria for depression, twice as many
women are diagnosed with depression as men (Kuehner
2003), yet the role that many contextual factors – such as
gender discrimination in society or the higher rates of
sexual abuse and assault in girls and women – are rarely
taken into account when diagnosing or measuring depres-
sion (Whitfield 2003).
This disregard of contextual factors reflects another
bias inherent within the DSM-IV diagnosis of depression,
the notion that mental disorders are located within indi-
viduals. This tendency to locate mental disorders and
problems inside individuals has important implications, as
it can easily direct clinicians’ attention away from the
social context of mental health issues. That is, numerous
authors have argued forcibly that it is equally plausible –

and perhaps more appropriate – to suggest that it may
well be our families, communities, and societies that
deserve such labels as ‘depressed’, ‘disordered’ or ‘men-
tally ill’, as opposed to individual persons (Crowe 2000;
Jensen & Hoagwood 1997; Russell 1986; Sarbin 1997;
Whitfield 2003). For example, a woman suffering from
domestic violence and seeking assistance from the mental
health system is likely to receive a psychiatric diagnosis
of depression and/or post-traumatic stress disorder, and
may be given instruments like the BDI to determine the
extent of her ‘disorder’. Yet the real source of the woman’s
problems – the perpetrator of the violence towards her –
is typically given no corresponding psychiatric diagnosis
(Eriksen & Kress 2005).
SUMMARY AND CONCLUSIONS
In summary, it has been shown that the BDI was created
in the historical context of the rise of psychopharmacol-
ogy and DSM nosological classification systems in the
mental health care system, …
I love class so this brief video is meant for helping you. Get
started on your paper part one the literature if you feel like 1. 1
of the tasks that you have to carry out for the paper is to
identify scales measures inventories questionnaires whatever
you want to call them know this is the same thing. That you can
use to measure the 2. Main problems that you identified for
your client annual case summary OK so we're going to use the
vector oppression inventory as an example I just go through this
process of 1st of all finding a suitable scale and then. Use for
validation studies research articles. That provide data
supporting the reliability and the from the duty of the sky OK
so this is the best depression inventories been very widely used
for many years and just a quick point here see this. Indicates

that. There is a. Pharmaceutical company. Sponsoring this this.
So this is the the P.D.I. to this 2nd version it contains 20 we
we're looking at the $21.00 question version there is a shorter
13 question as a $21.00 question Asian and. There are 21
questions this is an explanation of what the scale consists of.
Each answer scored on a scale of 0 to 3. And. This explaining
the cutoff. Used to determine the Neville of depression the
client has after completing this questionnaire of change from
the original so they use it now 0 to 13 is minimal depression
1419 mild depression 2028 moderate depression 292638 severe
depression so high in the The idea is that high a total scores
indicate more severe depressive symptoms on U.C. scale.
Another important thing in here is you see there is some
instruction now. That that tells the person who's completing the.
The scale the inventory. What they've got to do as an aside here
when you create your yourself made scale which is another part
of the assignment. Make sure that you give us a. Simple set of
instructions on how to complete your selfmade. Scale OK. Back
to this so anyway the back oppression him and Terry. Was.
There is a link between this and the D.S.M.. Diagnosis of.
Depressive Disorder which later became a major depressive
disorder and. Beck has attempted to match the questions with
the diagnostic criteria found in the D.S.M.. This raises an
important question. Because the D.S.M. is not is not known for
being scientifically valid or reliable. So this is. If it turns out
not to be. Providing us with a scientific description of the real.
Thing we call depression. Then that could impact the. The
actual usefulness of meaningfulness of this kind of O.U.I. that's
a whole different discussion practice that is a well used well
known skull and it serves the purpose of being an example for
how to you know. How to find a suitable scale and then to look
to the scale world OK. Good research article. Which would
constitute a validation study for this scale so you can see these
different. These different. Categories. That according to back.
Constitutes kind of a definition of depression these different
things are aspects of depression according to back. So then you

have the book once once the person has completed the inventory
they just. Add up that the score was. According to the cutoff
given right with getting you determine the level of depression
try OK So then let's go to validation studies that look at this. So
this is. From the International Journal of mental health nursing
so if you if you read and are working in a crisis stabilize a ship
you need to have a correct receive facility your view if you
alongside mental health nurses. So that you know we we have
certain overlaps in the work that we do. So this is a critique of
the P.D.I. and its use of mental health Nessie So just take a look
at the important things that make this article a good validation
study. So we see here just an introduction to a bit of background
about the beady eye that it was. Just like you saw it has a
corporate sponsor so it was. A pharmaceutical corporate sponsor
but it was designed to. In a way to prove the efficacy of
antidepressant drugs if you used to be before the person starts
treatment you get a certain measure to depression and then you
can use it to throughout treatment and in the end of treatment 2
you can. Theoretically. Measure the client's improvement.
Using this scale. And here the article is talking about the
relationship between banks scale and the. The D.S.M. and how
they scale this very morning is did you find. To become this
compatible as possible with the D.S.M.. Definition of this. This
construct we call depression. So then the the article goes on to
talk about the strengths of the B.V.I. easy to use in school and
then we go on to this very important section here it starts
talking about reliability so he hears that he's going to he's come
to give us some confirmation Alpha's which is what we're
looking for what if you file for us. And our phones are our 1st
range between 0 and one and these schools are really very good.
0.93 for college students 0.92 for psychiatric outpatients. And
I've. Met analysis a number of studies who were good average.
0.91 so these figures are great as far as a liability is concerned
that. The. Meet the criteria would be advisable study. However
these caution here. That one group of research is of suggested
P.D.I. may not be reliable for longer periods of time in known

clinical samples after finding that Beady Eye schools declined
by 40 percent of the 2 month long clinical sample. This was. A
small role $170.00 Dorie as far as fix. Trees concerned but if.
There is. A fair amount of evidence of its survival tool and look
onto villages he is is the video I measuring depression is it
measuring what we think it's measuring it measure with host to
be measuring what it says it's. So. Here we sit here we see that
according to these researches the contents of the face of the of
the video to argue to be very high because the items of the
video to now closely mirror the standard D.S.M. 4 this is
written by the. Diagnostic criteria for depression so. That's
where the kind of. 100 anyway. That the D.S.M. 4 is being
taken as scientific fact here and so because the scale. Closely
mirrors the diagnostic create criteria for depression in the
D.S.M. 4 in this case. It's being being stated to her very high.
Political. Problem is that the D.S.M. has not passed such.
Rigorous test. And was different kinds of from the you
hopefully will learned all about these and the. Videos about
reliability and that see now the level of discriminative validity
for the V.D.I. is less clear what that meaning is. Does it does it
just is it able to discriminate between people who are actually
depressed and people who are not depressed and so it doesn't
seem to do so well there however if we take if we go the way of
determining Flutie is if. We compare the results of the P.D.I.
with the results of other depression scales other branches of
depression and if they correlate highly then we can say that.
These The fact that the case. Validates the the the I. Because. It
also gives the same. The same kind of indications regarding
depression as other well known well established measures of
depression OK. So the summarizing then means for for the
V.D.I. of his to lie his ease of use widespread utilize ation very
good internal reliability high content validity when compared to
the D.S.M. for criteria for depression good convergence for the
other similar depression rating scales and a somewhat stable
factor structure factor structure OK but that means factor
structure actually if. You don't know about the causes of the

statistics. All right. Then it's it's important to. Note when you
when you're looking for articles some of Asian studies it's
important to note. What population the. The instrument being
used was normed on right so in this case the V.D.I. says it has
no actual launch population norms per se so it's difficult to
determine if any given individual's level of depression as
determined by the V.D.I. is normal in any sense of what word so
that's kind of a problem. Yeah you know do you know the
original sample from which the beady eye to a standardized was
predominantly Caucasian and is greatly missed representative of
the U.S. population at large so that you sometimes find is that.
The. Sound poll. That it's used for. The cooks standardizing the.
Scale the tree question whatever. We did with that population
that it was normal and on what is not necessarily representative
all of wider population or sometimes you know. A scale of only
even. One of only big normed on male participants. Nobody
else. Are exceptions so this these these kind of these kind of.
Things can. Kind of come because of the defect. Here saying
that women tend to score higher on the B.T.I. than men for this
would be one of those and. That is suggested here that that may
be because the because some off the items on the scale show
gender bias. And also several problems with the way that I feel
is content to be the word you wanted and waited. Because each
of a few you go back to the scale itself each of the items have
equal weight. Because you just you're just scoring each item and
then adding up the total. But the question is. For example you
do suicidal for soul wishes. Right is that really equivalent to
something my changes in appetite for feeling tired on boss of
interest in sex I mean devoted to those really. Quite to the
extent that you could wait to the quickly the answer is
obviously no so there is criticism on the V.D.I.. Then again then
another another problem with any kind of questioning is the
order in which the I'm too soft place because sometimes this can
create. Test taking a moment such as. A person. Just hearing
you sound as if several offers of notice you know that the
obvious ordering of **** in the video I may need to response is

reflective of faking social desirability that is that person. Gives
the answers that they think the person of mainstream test wants
them to get. Laid off alternatively they don't want to be. Putin
in a negative way so they get off of things that they think will
make them look better and it's cetera All right so that's that's
always something you have to consider that any question in.
Order to attract just adding up the score is. And there's the bit
about should early morning awakening be counted do the same
balance as guilt or suicidality not as if they are of equal
importance in determining determining a person's level to
oppression. And. The creators of the video I have offered no
justification for such a stance. The V.D.I. is supposedly a
theoretical However it is linked to particular cognitive
behavioral perspective as Fast 5 back to his colleagues and so.
In that sense it's not a theoretical This is sort of it's sort of
reflects back on. Views on this abstract called struck. The recall
depression. So there is that there. Was specks no different from
anybody else in very his own theory laden this fictive and
concepts will filters. To his thought. Process. OK so other other
criticisms say you can read. The whole story so thanks. To you
have selected items that reflect a theoretical stance whereby the
problem in this case depression is seen to live within the
individual and. Neglect. The multitude of external factors that
could be impacted person. Such as I believe and discrimination
based You just search it searching for a. Problem because he's
taking a certain way. Because this is very much a medical model
you're depressed a. Chemical Imbalance is he is a myth. And
you know more about this but just for us to say there is no.
Scientific basis for most of this chemical balance theorizing.
You know this stuff happens person's life is very likely to. Be
what we call depression. It's not necessary to do the genes of.
Biochemistry. I think. That's another home of the debate let's
get back to the. Task at hand here. All right so. This isn't the
point here that the V.D.I. exhibits the spiritual problem of
reification with a tendency to view abstract concepts as actual
entities that is the creation of the video I would have us believe

that the simple process of adding up the houses to 21 questions
about various symptoms and I choose allows us to measure with
a single number the quantity over and liability identifiable The
whole question as if we were measuring the weight or height of
an individual OK So this is so again. Down to the whole you
know what is what are these so called mental health conditions.
You know we're we develop the whole system of labeling and.
Identifying labeling and treating these so-called mental illnesses
as if things were made a whole nation who. Come to use the
medical model and of course it's suits. The health insurance
companies should sue speak pharma suits the OS of the system.
Is really based on. Good science so the question. Of. Issues of
alive if you do if you D.S.M. 4 this is a good article because it
actually raises this point. After coefficients for the diagnosis of
depression most point 25 so that this is this is a massive debate
but it's important to recognize that this. So-called mental health
or psychiatric Bible the D.S.M. 5 now. We take it as it is. As
factual. But it's really a compilation of expert opinion. Does not
or has not. To this point in time. Provided a scientific basis.
And so. Their liability to important to remember. OK so. It.
Could cost you. So that's that's that constitutes a very good
excellent in fact awful Don't use this one though if you should
if you decide to if you decide to use if you're if you one of your
clients problems is depression and you want to use the victim
Persian inventory that's fine but please find your own. Nation
studies phone just use this one because part of this. Exercise
that you know of one is choose to do research yourself you'll
find plenty of other. Research articles on the V.D.I. OK that's it
for now. So.

Seediscussions,stats,andauthorprofilesforthispublicati.docx

Recommended

Recommended

More Related Content

Similar to Seediscussions,stats,andauthorprofilesforthispublicati.docx

Similar to Seediscussions,stats,andauthorprofilesforthispublicati.docx (18)

More from zenobiakeeney

More from zenobiakeeney (20)

Recently uploaded

Recently uploaded (20)

Seediscussions,stats,andauthorprofilesforthispublicati.docx