Validity of the tool
Reliability of the tool
Mikki Khan
Ph. D. Scholar
 For the statistical consultant working with
social science researchers the estimation of
reliability and validity is a task frequently
encountered.
 Measurement issues differ in the social
sciences in that they are related to the
quantification of abstract, intangible and
unobservable constructs. In many instances,
then, the meaning of quantities is only
inferred.
Validity refers to the accuracy of inferences
drawn from an assessment.
It is the degree to which the assessment
measures what it is intended to measure.
 An instrument cannot validly measure an
attribute if it is inconsistent and inaccurate.
 An unreliable instrument contains too much
error to be a valid indicator of the target
variable.
 An instrument can, however, be reliable
without being valid. Suppose we had the idea
to assess patients’ anxiety by measuring the
circumference of their wrists.
 would not be valid indicators of anxiety.
Thus, the high reliability of an instrument
provides no evidence of its validity; low
reliability of a measure is evidence of low
validity.
 We could obtain highly accurate, consistent,
and precise measurements of wrist
circumferences, but such measures
 Like reliability, validity has different aspects
and assessment approaches. Unlike
reliability, however, an instrument’s validity
is difficult to establish.
 There are no equations that can easily be
applied to the scores of a hopelessness scale
to estimate how good a job the scale is doing
in measuring the critical variable.
 1. Face validity
 2. Content validity
 3. Pragmatic (criterion) validity
 A. Concurrent validity
 B. Predictive validity
 4. Construct validity
 B. Convergent validity
 C. Discriminant validity
 Face validity refers to whether the
instrument looks as though it is measuring
the appropriate construct.
 Although face validity should not be
considered primary evidence for an
instrument’s validity, it is helpful for a
measure to have face validity if other types
of validity have also been demonstrated.
Subjective judgment of experts
about:
 “what’s there”
 Do the measures make sense?
Compare each item to conceptual
definition
 Do it represent the concept in question?
 If not, it should be dropped
 Is the measure valid “on its face”
 Content validity concerns the degree to
which an instrument has an appropriate
sample of items for the construct being
measured. Content validity is relevant for
both affective measures (i.e., measures
relating to feelings, emotions, and
psychological traits) and cognitive measures.
 For cognitive measures, the content validity
question is, How representative are the
questions on
 this test of the universe of questions on this
topic?
 For example, suppose we were testing
students’ knowledge about major nursing
theories. The test would not be content valid
if it omitted questions about, for example,
Orem’s self-care theory.
 Establishing criterion-related validity
involves determining the relationship
between an instrument and an external
criterion. The instrument is said to be valid if
its scores correlate highly with scores on the
criterion. For example, if a measure of
attitudes toward premarital sex correlates
highly with subsequent loss of virginity in a
sample of teenagers, then the attitude scale
would have good validity.
 Criterion-related validity is most appropriate
when there is a concrete, well-accepted
criterion. Once a criterion is selected,
criterion validity can be assessed easily.
 A correlation coefficient is computed
between scores on the instrument and the
criterion. The magnitude of the coefficient is
a direct estimate of how valid the instrument
is, according to this validation method.
 To illustrate, suppose researchers developed
a scale to measure nurses’ professionalism.
 A correlation coefficient is computed
between scores on the instrument and the
criterion.
 The magnitude of the coefficient is a direct
estimate of how valid the instrument is,
according to this validation method. To
illustrate, suppose researchers developed a
scale to measure nurses’ professionalism.
 Predictive validity refers to the adequacy of
an instrument in differentiating between
people’s performance on some future
criterion. When a school of nursing correlates
incoming students’ SAT scores with
subsequent grade-point averages, the
predictive validity of the SATs for nursing
school performance is being evaluated.
 Concurrent validity refers to an instrument’s
ability to distinguish individuals who differ
on a present criterion. For example, a
psychological test to differentiate between
those patients in a mental institution who
can and cannot be released could be
correlated with current behavioral ratings of
health care personnel.
 The difference between predictive and
concurrent validity, then, is the difference in
the timing of obtaining measurements on a
criterion.
 Criterion-related validity is helpful in
assisting decision makers by giving them
some assurance that their decisions will be
effective, fair, and, in short, valid.
 Validating an instrument in terms of
construct validity is a challenging task. The
key construct validity questions are: What is
this instrument really measuring?
 Does it adequately measure the abstract
concept of interest?
 Unfortunately, the more abstract the
concept, the more difficult it is to establish
construct validity; at the same time, the
more abstract the concept, the less suitable
it is to rely on criterion related validity.
 Actually, it is really not just a question of
suitability: What objective criterion is there
for such concepts as empathy, role conflict,
or separation anxiety?
 Despite the difficulty of construct validation,
it is an activity vital to the development of a
 strong evidence base. The constructs in
which nurse researchers are interested must
be validly measured.
 Construct validity is inextricably linked with
theoretical factors.
 In validating a measure of death anxiety, we
would be less concerned with the adequate
sampling of items or with its relationship to a
criterion than with its correspondence to a
cogent conceptualization of death anxiety.
 Construct validation can be approached in
several ways, but it always involves logical
analysis and tests predicted by theoretical
considerations.
 Constructs are explicated in terms of other
abstract concepts; researchers make
predictions about the manner in which the
target construct will function in relation to
other constructs.
 One construct validation approach is the
known-groups technique.
 In this procedure, the instrument is
administered to groups expected to differ on
the critical attribute because of some known
characteristic.
 For instance, in validating a measure of fear
of the labor experience, we could contrast
the scores of primiparas and multiparas.
 We would expect that women who had never
given birth would be more anxious than
women who had done so, and so we might
question the instrument’s validity if such
differences did not emerge.
 We would not necessarily expect large
differences; some primiparas would feel
little anxiety, and some multiparas would
express some fears. On the whole, however,
we would anticipate differences in average
group scores.
 To validate the scale, they compared the
scores of labor and delivery nurses with
those of nurses who worked in postpartum
care and found significantly higher scores
among the first group, as predicted.
 Convergent Validity
 Measuring a concept with different methods.
If different methods yield the same results:
than convergent validity is supported
 Discriminant (Divergent) Validity
 Measuring a concept to discriminate that
concept from other closely related concepts
E.g., Measuring Maternalism and Paternalism
as distinct concepts
 The reliability of a quantitative instrument is
a major criterion for assessing its quality and
adequacy.
 An instrument’s reliability is the consistency
with which it measures the target attribute.
 In statistics or measurement theory, a
measurement or test is considered reliable if
it produces consistent results over repeated
testings.
 If a scale weighed a person at 120 pounds
one minute and 150 pounds the next, we
would consider it unreliable.
 The less variation an instrument produces in
repeated measurements, the higher its
reliability.
 Thus, reliability can be equated with a
measure’s stability, consistency, or
dependability.
 Reliability also concerns a measure’s
accuracy.
 An instrument is reliable to the extent that
its measures reflect true scores—that is, to
the extent that errors of measurement are
absent from obtained scores.
 A reliable measure maximizes the true score
component and minimizes the error
component.
 These two ways of explaining reliability
(consistency and accuracy) are not so
different as they might appear.
 Errors of measurement that impinge on an
instrument’s accuracy also affect its
consistency.
 The example of the scale with variable
weight readings illustrates this point.
 Suppose that the true weight of a person is
125 pounds, but that two independent
measurements yielded 120 and 150 pounds.
 we could express the measurements as
follows:
 120 _ 125 _ 5
 150 _ 125 _ 25
 The errors of measurement for the two trials
(_5 and _25, respectively) resulted in scores
that are inconsistent and inaccurate.
 The reliability of an instrument can be
assessed in various ways.
 The method chosen depends on the nature of
the instrument and on the aspect of
reliability of greatest concern.
 Three key aspects are stability, internal
consistency, and equivalence.
 The stability of an instrument is the extent
to which similar results are obtained on two
separate administrations.
 The reliability estimate focuses on the
instrument’s susceptibility to extraneous
factors over time, such as subject fatigue or
environmental conditions. Assessments of an
instrument’s stability involve procedures that
evaluate test–retest reliability.
 Researchers administer the same measure to
a sample on two occasions and then compare
the scores. The comparison is performed
objectively by computing a reliability
coefficient, which is a numeric index of the
magnitude of the test’s reliability.
 To explain a reliability coefficient, we must
briefly discuss a statistic known as the
correlation coefficient.
 We have pointed out repeatedly that
researchers strive to detect and explain
relationships among phenomena:
 Is there a relationship between patients’
gastric acidity levels and exposure to stress?
 Is there a relationship between body
temperature and physical exertion?
 The correlation coefficient is a tool for
quantitatively describing the magnitude and
direction of a relationship between two
variables.
 It is more important to understand how to
read a correlation coefficient.
 Two variables that are obviously related are
people’s height and weight.
 Tall people tend to be heavier than short
people.
 We would say that there was a perfect
relationship if the tallest person in a
population were the heaviest, the second
tallest person were the second heaviest, and
so forth.
 Correlation coefficients summarize how
perfect relationships are. The possible values
for a correlation coefficient range from _1.00
through .00 to _1.00. If height and weight
were perfectly correlated, the correlation
coefficient expressing this relationship would
be 1.00.
 Because the relationship does exist but is
not perfect, the correlation coefficient is
typically in the vicinity of .50 or .60.
 The relationship between height and weight
can be described as a positive relationship
because increases in height tend to be
associated with increases in weight.
 When two variables are totally unrelated,
the correlation coefficient equals zero.
Cont…………….
 One might expect that women’s dress sizes
are unrelated to their intelligence. Large
women are as likely to perform well on IQ
tests as small women.
 The correlation coefficient summarizing such
a relationship would presumably be in the
vicinity of .00. Correlation coefficients
running from .00 to _1.00 express inverse or
negative relationships.
 When two variables are inversely related,
increases in one variable are associated with
decreases in the second variable.
 With test–retest reliability, an instrument is
administered twice to the same sample.
 Suppose we wanted to assess the stability of
a self-esteem scale. Self-esteem is a fairly
stable attribute that does not fluctuate much
from day to day, so we would expect a
reliable measure of it to yield consistent
scores on two occasions.
 To check the instrument’s stability, we
administer the scale 3 weeks apart to a
sample of 10 people.
 The differences in scores on the two testings
are not large. The reliability coefficient for
test–retest estimates is the correlation
coefficient between the two sets of scores.
In this example, the computed reliability
coefficient is .95, which is high.
 In practice, reliability coefficients normally
range between .00 and 1.00. The higher the
coefficient, the more stable the measure.
Reliability coefficients above .70 usually are
considered satisfactory.
 In some situations, a higher coefficient may
be required, or a lower one may be
acceptable.
 The test–retest method is a relatively easy
approach to estimating reliability, and can be
used with self-report, observational, and
physiologic measures.
 Disadvantages
 The test–retest approach has certain
disadvantages, however. One issue is that
many traits do change over time,
independently of the measure’s stability.
Attitudes, behaviors, knowledge, physical
condition, and so forth can be modified by
experiences between testings. Test–retest
procedures confound changes from
measurement error and those from true
changes in the attribute being measured.
Still, there are many relatively enduring
attributes for which a test–retest approach is
suitable.
 The test–retest approach has certain
disadvantages, however. One issue is that
many traits do change over time,
independently of the measure’s stability.
 Attitudes, behaviors, knowledge, physical
condition, and so forth can be modified by
experiences between testings.
 Test–retest procedures confound changes
from measurement error and those from true
changes in the attribute being measured.
Still, there are many relatively enduring
attributes for which a test–retest approach is
suitable.
 Stability estimates suffer from other
problems, however. One possibility is that
subjects’ responses or observers’ coding on
the second administration will be influenced
by their memory of initial responses or
coding, regardless of the actual values the
second day.
 Such memory interference results in
spuriously high reliability coefficients.
 Another difficulty is that subjects may
actually change as a result of the first
administration.
 Finally, people may not be as careful using
the same instrument a second time.
 If they find the process boring on the second
occasion, then responses could be
haphazard, resulting in a spuriously low
estimate of stability.
 On the whole, reliability coefficients tend to
be higher for short-term retests than for
long-term retests (i.e., those greater than 1
or 2 months) because of actual changes in
the attribute being measured.
 Stability indexes are most appropriate for
relatively enduring characteristics such as
personality, abilities, or certain physical
attributes such as adult height.
 An instrument may be said to be internally
consistent or homogeneous to the extent
that its items measure the same trait.
 Internal consistency reliability is the most
widely used reliability approach among nurse
researchers.
 Its popularity reflects the fact that it is
economical (it requires only one test
administration) and is the best means of
assessing an especially important source of
measurement error in psychosocial
instruments, the sampling of items.
 The internal consistency of the subscales is
typically assessed and, if subscale scores are
summed for an overall score, the scale’s
internal consistency would also be assessed.
 One of the oldest methods for assessing
internal consistency is the split-half
technique. For this approach, items on a
scale are split into two groups and scored
independently. Scores on the two half tests
then are used to compute a correlation
coefficient.
 Let us say that the total instrument consists
of 20 questions, and so the items must be
divided into two groups of 10.
 Although many splits are possible, the usual
procedure is to use odd items versus even
items.
 The correlation coefficient for scores on the
two half-tests gives an estimate of the
scale’s internal consistency. If the odd items
are measuring the same attribute as the even
items, then the reliability coefficient should
be high.
 The most widely used method for evaluating
internal consistency is coefficient alpha (or
Cronbach’s alpha). Coefficient alpha can be
interpreted like other reliability coefficients
described here; the normal range of values is
between .00 and _1.00, and higher values
reflect a higher internal consistency.
 Coefficient alpha is preferable to the split-
half procedure because it gives an estimate
of the split-half correlation for all possible
ways of dividing the measure into two
halves.
 The split-half technique has been used to
estimate homogeneity, but coefficient alpha
is preferable. Neither approach considers
fluctuations over time as a source of
unreliability.
 Nurse researchers estimate a measure’s
reliability by way of the equivalence
approach primarily with observational
measures.
 Researchers should assess the reliability of
observational instruments. In this case,
“instrument” includes both the category and
rating system and the observers making the
measurements.
 Interrater (or interobserver) reliability is
estimated by having two or more trained
observers watching an event simultaneously,
and independently recording data according
to the instrument’s instructions.
 The data can then be used to compute an
index of equivalence or agreement between
observers. For certain types of observational
data (e.g., ratings), correlation techniques
are suitable.
 That is, a correlation coefficient is computed
to demonstrate the strength of the
relationship between one observer’s ratings
and another’s.
 Another procedure is to compute reliability
as a function of agreements, using the
following equation:
 Number of agreements
 Number of agreements _ disagreements
 This simple formula unfortunately tends to
overestimate observer agreements. If the
behavior under investigation is one that
observers code for absence or presence
every, say, 10 seconds, the observers will
agree 50% of the time by chance alone.
 Other approaches to estimating interrater
reliability may be of interest to advanced
students. Techniques such as Cohen’s kappa,
analysis of variance, intraclass correlations,
and rank-order correlations have been used
to assess interobserver reliability.
 Interpretation of Reliability Coefficients
 Reliability coefficients are important
indicators of an instrument’s quality.
 Unreliable measures do not provide adequate
tests of researchers’ hypotheses.
 If data fail to confirm a prediction, one
possibility is that the instruments were
unreliable not necessarily that the expected
relationships do not exist.
 Knowledge about an instrument’s reliability
thus is critical in interpreting research
results, especially if research hypotheses are
not supported.
 For group-level comparisons, coefficients in
the vicinity of .70 are usually adequate,
although coefficients of .80 or greater are
highly desirable.
 By group-level comparisons, we mean that
researchers compare scores of groups, such
as male versus female or experimental versus
control subjects.
 If measures are used for making decisions
about individuals, then reliability
coefficients ideally should be .90 or better.
 For instance, if a test score was used as a
criterion for admission to a graduate nursing
program, then the accuracy of the test would
be of critical importance to both individual
applicants and the school of nursing.
 In general, items that elicit a 50_50 split
(e.g., agree/disagree or correct/incorrect)
have the best discriminating power. As a
general guideline, if the split is 80/20 or
worse, the item should probably be replaced.
 Another aspect of an item analysis is an
inspection of the correlations between
individual items and the overall scale score.
Item-to-total correlations below .30 are
usually considered unacceptably low.
 The extent to which a test measures what it
was designed to measure.
 Agreement between a test score or measure
and the quality it is believed to measure.
 Proliferation of definitions led to a dilution
of the meaning of the word into all kinds of
“validities”
 Internal validity – Cause and effect in
experimentation; high levels of control;
elimination of confounding variables
 External validity - to what extent one may safely
generalize the (internally valid) causal inference
(a) from the sample studied to the defined
target population and (b) to other populations
(i.e. across time and space). Generalize to other
people
 Population validity – can the sample results be
generalized to the target population
 Ecological validity - whether the results can be
applied to real life situations. Generalize to other
(real) situations
 Content validity – when trying to measure a
domain are all sub-domains represented
 When measuring depression are all 16 clinical
criteria represented in the items
 Very complimentary to domain sampling theory
and reliability
 However, often high levels of content validity
will lead to lower internal consistency reliability
 Construct validity – overall are you measuring
what you are intending to measure
 Intentional validity – are you measuring what you
are intending and not something else. Requires
that constructs be specific enough to
differentiate
 Representation validity or translation validity –
how well have the constructs been translated
into measureable outcomes. Validity of the
operational definitions
 Face validity – Does a test “appear” to be
measuring the content of interest. Do questions
about depression have the words “sad” or
“depressed” in them
 Construct Validity
 Observation validity – how good are the measures
themselves. Akin to reliability
 Convergent validity - Convergent validity refers
to the degree to which a measure is correlated
with other measures that it is theoretically
predicted to correlate with.
 Discriminant validity - Discriminant validity
describes the degree to which the
operationalization does not correlate with other
operationalizations that it theoretically should
not correlated with.
 Criterion-Related Validity - the success of
measures used for prediction or estimation.
There are two types:
 Concurrent validity - the degree to which a test
correlates with an external criteria that is measured
at the same time (e.g. does a depression inventory
correlated with clinical diagnoses)
 Predictive validity - the degree to which a test
predicts (correlates) with an external criteria that is
measured some time in the future (e.g. does a
depression inventory score predict later clinical
diagnosis)
 Social validity – refers to the social importance
and acceptability of a measure
 There is a total mess of “validities” and their
definitions, what to do?
 1985 - Joint Committee of
 AERA: American Education Research Association
 APA: American Psychological Association
 NCME: National Council on Measurement in
Education
 developed Standards for Educational and
Psychological Testing (revised in 1999).
 According to the Joint Committee:
 Validity is the evidence for inferences made
about a test score.
 Three types of evidence:
 Content-related
 Criterion-related
 Construct-related
 Different from the notion of “different types
of validity”
 Content-related evidence (Content Validity)
 Based upon an analysis of the body of knowledge
surveyed.
 Criterion-related evidence (Criterion
Validity)
 Based upon the relationship between scores on a
particular test and performance or abilities on a
second measure (or in real life).
 Construct-related evidence (Construct
Validity)
 Based upon an investigation of the psychological
constructs or characteristics of the test.
 Face Validity
 The mere appearance that a test has validity.
 Does the test look like it measures what it is
supposed to measure?
 Do the items seem to be reasonably related to
the perceived purpose of the test.
 Does a depression inventory ask questions
about being sad?
 Not a “real” measure of validity, but one that is
commonly seen in the literature.
 Not considered legitimate form of validity by the
Joint Committee.
 Does the test adequately sample the content
or behavior domain that it is designed to
measure?
 If items are not a good sample, results of
testing will be misleading.
 Usually developed during test development.
 Not generally empirically evaluated.
 Judgment of subject matter experts.
 To develop a test with high content-related
evidence of validity, you need:
 good logic
 intuitive skills
 Perseverance
 Must consider:
 wording
 reading level
 Other content-related evidence terms
 Construct underrepresentation: failure to
capture important components of a construct.
 Test is designed for chapters 1-10 but only chapters 1-
8 show up on the test.
 Construct-irrelevant variance: occurs when
scores are influenced by factors irrelevant to the
construct.
 Test is well-intentioned, but problems secondary to
the test negatively influence the results (e.g., reading
level, vocabulary, unmeasured secondary domains)
 Tells us how well a test corresponds with a
particular criterion
 criterion: behavioral or measurable outcome
 SAT predicting GPA (GPA is criterion)
 BDI scores predicting suicidality (suicide is
criterion).
 Used to “predict the future” or “predict the
present.”
 Predictive Validity Evidence
 forecasting the future
 how well does a test predict future outcomes
 SAT predicting 1st yr GPA
 most tests don’t have great predictive validity
 decrease due to time & method variance
 Concurrent Validity Evidence
 forecasting the present
 how well does a test predict current similar
outcomes
 job samples, alternative tests used to
demonstrate concurrent validity evidence
 generally higher than predictive validity
estimates
 Validity Coefficient
 correlation between the test and the criterion
 usually between .30 and .60 in real life.
 In general, as long as they are statistically
significant, evidence is considered valid.
 However,
 recall that r2 indicates explained variance.
 SO, in reality, we are only looking at explained
criterion variance in the range of 9 to 36%.
 Sound Problematic??
 Look for changes in the cause of relationships.
(third variable effect)
 E.g. Situational factors during validation that are
replicated in later uses of the scale
 Examine what the criterion really means.
 Optimally the criterion should be something the
test is trying to measure
 If the criterion is not valid and reliable, you have
no evidence of criterion-related validity!
 Review the subject population in the validity
study.
 If the normative sample is not representative, you
have little evidence of criterion-related validity.
 Ensure the sample size in the validity study was
adequate.
 Never confuse the criterion with the predictor.
 GREs are used to predict success in grad school
 Some grad programs may admit low GRE students
but then require a certain GRE before they can
graduate.
 So, low GRE scores succeed, this demonstrates poor
predictive validity!
 But the process was dumb to begin with…
 Watch for restricted ranges.
 Review evidence for validity generalization.
 Tests only given in laboratory settings, then
expected to demonstrate validity in classrooms?
 Ecological validity?
 Consider differential prediction.
 Just because a test has good predictive validity
for the normative sample may not ensure good
predictive validity for people outside the
normative sample.
 External validity?
 Construct: something constructed by mental
synthesis
 What is Intelligence? Love? Depression?
 Construct Validity Evidence
 assembling evidence about what a test means
(and what it doesn’t)
 sequential process; generally takes several
studies
 Convergent Evidence
 obtained when a measure correlates well with
other tests believed to measure the same
construct.
 Self-report, collateral-report measures
 Discriminant Evidence
 obtained when a measure correlates less strong
with other tests believed to measure something
slightly different
 This does not mean any old test that you know
won’t correlate; should be something that could be
related but you want to show is separate
 Example: IQ and Achievement Tests
 Standard Error of Estimate:
 standard error of estimate
 standard deviation of the test
 validity of the test
 Essentially, this is regression all over again.
2
ˆ.
1
(1 )
2
est yY Y
N
s s s r
N
 
    
 
.ests
ys
r
 Maximum Validity depends on Reliability
 is the maximum validity
 is the reliability of test 1
 is the reliability of test 1
12max 1 2r rr
12maxr
1r
2r
Reliability of Test Reliability of Criterion
Maximum Validity
(Correlation)
1 1 1.00
0.8 1 0.89
0.6 1 0.77
0.4 1 0.63
0.2 1 0.45
0 1 0.00
1 0.5 0.71
0.8 0.5 0.63
0.6 0.5 0.55
0.4 0.5 0.45
0.2 0.5 0.32
0 0.5 0.00
1 0.2 0.45
0.8 0.2 0.40
0.6 0.2 0.35
0.4 0.2 0.28
0.2 0.2 0.20
0 0.2 0.00
1 0 0.00
0.8 0 0.00
0.6 0 0.00
0.4 0 0.00
0.2 0 0.00
0 0 0.00

Topic validity

  • 2.
    Validity of thetool Reliability of the tool Mikki Khan Ph. D. Scholar
  • 3.
     For thestatistical consultant working with social science researchers the estimation of reliability and validity is a task frequently encountered.  Measurement issues differ in the social sciences in that they are related to the quantification of abstract, intangible and unobservable constructs. In many instances, then, the meaning of quantities is only inferred.
  • 4.
    Validity refers tothe accuracy of inferences drawn from an assessment. It is the degree to which the assessment measures what it is intended to measure.
  • 5.
     An instrumentcannot validly measure an attribute if it is inconsistent and inaccurate.  An unreliable instrument contains too much error to be a valid indicator of the target variable.  An instrument can, however, be reliable without being valid. Suppose we had the idea to assess patients’ anxiety by measuring the circumference of their wrists.
  • 6.
     would notbe valid indicators of anxiety. Thus, the high reliability of an instrument provides no evidence of its validity; low reliability of a measure is evidence of low validity.  We could obtain highly accurate, consistent, and precise measurements of wrist circumferences, but such measures
  • 7.
     Like reliability,validity has different aspects and assessment approaches. Unlike reliability, however, an instrument’s validity is difficult to establish.  There are no equations that can easily be applied to the scores of a hopelessness scale to estimate how good a job the scale is doing in measuring the critical variable.
  • 8.
     1. Facevalidity  2. Content validity  3. Pragmatic (criterion) validity  A. Concurrent validity  B. Predictive validity  4. Construct validity  B. Convergent validity  C. Discriminant validity
  • 9.
     Face validityrefers to whether the instrument looks as though it is measuring the appropriate construct.  Although face validity should not be considered primary evidence for an instrument’s validity, it is helpful for a measure to have face validity if other types of validity have also been demonstrated.
  • 10.
    Subjective judgment ofexperts about:  “what’s there”  Do the measures make sense? Compare each item to conceptual definition  Do it represent the concept in question?  If not, it should be dropped  Is the measure valid “on its face”
  • 11.
     Content validityconcerns the degree to which an instrument has an appropriate sample of items for the construct being measured. Content validity is relevant for both affective measures (i.e., measures relating to feelings, emotions, and psychological traits) and cognitive measures.  For cognitive measures, the content validity question is, How representative are the questions on  this test of the universe of questions on this topic?
  • 12.
     For example,suppose we were testing students’ knowledge about major nursing theories. The test would not be content valid if it omitted questions about, for example, Orem’s self-care theory.
  • 13.
     Establishing criterion-relatedvalidity involves determining the relationship between an instrument and an external criterion. The instrument is said to be valid if its scores correlate highly with scores on the criterion. For example, if a measure of attitudes toward premarital sex correlates highly with subsequent loss of virginity in a sample of teenagers, then the attitude scale would have good validity.
  • 14.
     Criterion-related validityis most appropriate when there is a concrete, well-accepted criterion. Once a criterion is selected, criterion validity can be assessed easily.  A correlation coefficient is computed between scores on the instrument and the criterion. The magnitude of the coefficient is a direct estimate of how valid the instrument is, according to this validation method.
  • 15.
     To illustrate,suppose researchers developed a scale to measure nurses’ professionalism.  A correlation coefficient is computed between scores on the instrument and the criterion.  The magnitude of the coefficient is a direct estimate of how valid the instrument is, according to this validation method. To illustrate, suppose researchers developed a scale to measure nurses’ professionalism.
  • 16.
     Predictive validityrefers to the adequacy of an instrument in differentiating between people’s performance on some future criterion. When a school of nursing correlates incoming students’ SAT scores with subsequent grade-point averages, the predictive validity of the SATs for nursing school performance is being evaluated.
  • 17.
     Concurrent validityrefers to an instrument’s ability to distinguish individuals who differ on a present criterion. For example, a psychological test to differentiate between those patients in a mental institution who can and cannot be released could be correlated with current behavioral ratings of health care personnel.
  • 18.
     The differencebetween predictive and concurrent validity, then, is the difference in the timing of obtaining measurements on a criterion.  Criterion-related validity is helpful in assisting decision makers by giving them some assurance that their decisions will be effective, fair, and, in short, valid.
  • 19.
     Validating aninstrument in terms of construct validity is a challenging task. The key construct validity questions are: What is this instrument really measuring?  Does it adequately measure the abstract concept of interest?  Unfortunately, the more abstract the concept, the more difficult it is to establish construct validity; at the same time, the more abstract the concept, the less suitable it is to rely on criterion related validity.
  • 20.
     Actually, itis really not just a question of suitability: What objective criterion is there for such concepts as empathy, role conflict, or separation anxiety?  Despite the difficulty of construct validation, it is an activity vital to the development of a  strong evidence base. The constructs in which nurse researchers are interested must be validly measured.
  • 21.
     Construct validityis inextricably linked with theoretical factors.  In validating a measure of death anxiety, we would be less concerned with the adequate sampling of items or with its relationship to a criterion than with its correspondence to a cogent conceptualization of death anxiety.
  • 22.
     Construct validationcan be approached in several ways, but it always involves logical analysis and tests predicted by theoretical considerations.  Constructs are explicated in terms of other abstract concepts; researchers make predictions about the manner in which the target construct will function in relation to other constructs.
  • 23.
     One constructvalidation approach is the known-groups technique.  In this procedure, the instrument is administered to groups expected to differ on the critical attribute because of some known characteristic.  For instance, in validating a measure of fear of the labor experience, we could contrast the scores of primiparas and multiparas.
  • 24.
     We wouldexpect that women who had never given birth would be more anxious than women who had done so, and so we might question the instrument’s validity if such differences did not emerge.  We would not necessarily expect large differences; some primiparas would feel little anxiety, and some multiparas would express some fears. On the whole, however, we would anticipate differences in average group scores.
  • 25.
     To validatethe scale, they compared the scores of labor and delivery nurses with those of nurses who worked in postpartum care and found significantly higher scores among the first group, as predicted.
  • 26.
     Convergent Validity Measuring a concept with different methods. If different methods yield the same results: than convergent validity is supported  Discriminant (Divergent) Validity  Measuring a concept to discriminate that concept from other closely related concepts E.g., Measuring Maternalism and Paternalism as distinct concepts
  • 27.
     The reliabilityof a quantitative instrument is a major criterion for assessing its quality and adequacy.  An instrument’s reliability is the consistency with which it measures the target attribute.  In statistics or measurement theory, a measurement or test is considered reliable if it produces consistent results over repeated testings.
  • 28.
     If ascale weighed a person at 120 pounds one minute and 150 pounds the next, we would consider it unreliable.  The less variation an instrument produces in repeated measurements, the higher its reliability.  Thus, reliability can be equated with a measure’s stability, consistency, or dependability.
  • 29.
     Reliability alsoconcerns a measure’s accuracy.  An instrument is reliable to the extent that its measures reflect true scores—that is, to the extent that errors of measurement are absent from obtained scores.  A reliable measure maximizes the true score component and minimizes the error component.
  • 30.
     These twoways of explaining reliability (consistency and accuracy) are not so different as they might appear.  Errors of measurement that impinge on an instrument’s accuracy also affect its consistency.  The example of the scale with variable weight readings illustrates this point.
  • 31.
     Suppose thatthe true weight of a person is 125 pounds, but that two independent measurements yielded 120 and 150 pounds.  we could express the measurements as follows:  120 _ 125 _ 5  150 _ 125 _ 25  The errors of measurement for the two trials (_5 and _25, respectively) resulted in scores that are inconsistent and inaccurate.
  • 32.
     The reliabilityof an instrument can be assessed in various ways.  The method chosen depends on the nature of the instrument and on the aspect of reliability of greatest concern.  Three key aspects are stability, internal consistency, and equivalence.
  • 33.
     The stabilityof an instrument is the extent to which similar results are obtained on two separate administrations.  The reliability estimate focuses on the instrument’s susceptibility to extraneous factors over time, such as subject fatigue or environmental conditions. Assessments of an instrument’s stability involve procedures that evaluate test–retest reliability.
  • 34.
     Researchers administerthe same measure to a sample on two occasions and then compare the scores. The comparison is performed objectively by computing a reliability coefficient, which is a numeric index of the magnitude of the test’s reliability.
  • 35.
     To explaina reliability coefficient, we must briefly discuss a statistic known as the correlation coefficient.  We have pointed out repeatedly that researchers strive to detect and explain relationships among phenomena:  Is there a relationship between patients’ gastric acidity levels and exposure to stress?  Is there a relationship between body temperature and physical exertion?
  • 36.
     The correlationcoefficient is a tool for quantitatively describing the magnitude and direction of a relationship between two variables.  It is more important to understand how to read a correlation coefficient.
  • 37.
     Two variablesthat are obviously related are people’s height and weight.  Tall people tend to be heavier than short people.  We would say that there was a perfect relationship if the tallest person in a population were the heaviest, the second tallest person were the second heaviest, and so forth.
  • 38.
     Correlation coefficientssummarize how perfect relationships are. The possible values for a correlation coefficient range from _1.00 through .00 to _1.00. If height and weight were perfectly correlated, the correlation coefficient expressing this relationship would be 1.00.
  • 39.
     Because therelationship does exist but is not perfect, the correlation coefficient is typically in the vicinity of .50 or .60.  The relationship between height and weight can be described as a positive relationship because increases in height tend to be associated with increases in weight.  When two variables are totally unrelated, the correlation coefficient equals zero.
  • 40.
    Cont…………….  One mightexpect that women’s dress sizes are unrelated to their intelligence. Large women are as likely to perform well on IQ tests as small women.  The correlation coefficient summarizing such a relationship would presumably be in the vicinity of .00. Correlation coefficients running from .00 to _1.00 express inverse or negative relationships.  When two variables are inversely related, increases in one variable are associated with decreases in the second variable.
  • 41.
     With test–retestreliability, an instrument is administered twice to the same sample.  Suppose we wanted to assess the stability of a self-esteem scale. Self-esteem is a fairly stable attribute that does not fluctuate much from day to day, so we would expect a reliable measure of it to yield consistent scores on two occasions.
  • 42.
     To checkthe instrument’s stability, we administer the scale 3 weeks apart to a sample of 10 people.  The differences in scores on the two testings are not large. The reliability coefficient for test–retest estimates is the correlation coefficient between the two sets of scores. In this example, the computed reliability coefficient is .95, which is high.
  • 43.
     In practice,reliability coefficients normally range between .00 and 1.00. The higher the coefficient, the more stable the measure. Reliability coefficients above .70 usually are considered satisfactory.  In some situations, a higher coefficient may be required, or a lower one may be acceptable.
  • 44.
     The test–retestmethod is a relatively easy approach to estimating reliability, and can be used with self-report, observational, and physiologic measures.
  • 45.
     Disadvantages  Thetest–retest approach has certain disadvantages, however. One issue is that many traits do change over time, independently of the measure’s stability. Attitudes, behaviors, knowledge, physical condition, and so forth can be modified by experiences between testings. Test–retest procedures confound changes from measurement error and those from true changes in the attribute being measured. Still, there are many relatively enduring attributes for which a test–retest approach is suitable.
  • 46.
     The test–retestapproach has certain disadvantages, however. One issue is that many traits do change over time, independently of the measure’s stability.  Attitudes, behaviors, knowledge, physical condition, and so forth can be modified by experiences between testings.  Test–retest procedures confound changes from measurement error and those from true changes in the attribute being measured. Still, there are many relatively enduring attributes for which a test–retest approach is suitable.
  • 47.
     Stability estimatessuffer from other problems, however. One possibility is that subjects’ responses or observers’ coding on the second administration will be influenced by their memory of initial responses or coding, regardless of the actual values the second day.
  • 48.
     Such memoryinterference results in spuriously high reliability coefficients.  Another difficulty is that subjects may actually change as a result of the first administration.  Finally, people may not be as careful using the same instrument a second time.  If they find the process boring on the second occasion, then responses could be haphazard, resulting in a spuriously low estimate of stability.
  • 49.
     On thewhole, reliability coefficients tend to be higher for short-term retests than for long-term retests (i.e., those greater than 1 or 2 months) because of actual changes in the attribute being measured.  Stability indexes are most appropriate for relatively enduring characteristics such as personality, abilities, or certain physical attributes such as adult height.
  • 50.
     An instrumentmay be said to be internally consistent or homogeneous to the extent that its items measure the same trait.  Internal consistency reliability is the most widely used reliability approach among nurse researchers.  Its popularity reflects the fact that it is economical (it requires only one test administration) and is the best means of assessing an especially important source of measurement error in psychosocial instruments, the sampling of items.
  • 51.
     The internalconsistency of the subscales is typically assessed and, if subscale scores are summed for an overall score, the scale’s internal consistency would also be assessed.  One of the oldest methods for assessing internal consistency is the split-half technique. For this approach, items on a scale are split into two groups and scored independently. Scores on the two half tests then are used to compute a correlation coefficient.
  • 52.
     Let ussay that the total instrument consists of 20 questions, and so the items must be divided into two groups of 10.  Although many splits are possible, the usual procedure is to use odd items versus even items.  The correlation coefficient for scores on the two half-tests gives an estimate of the scale’s internal consistency. If the odd items are measuring the same attribute as the even items, then the reliability coefficient should be high.
  • 53.
     The mostwidely used method for evaluating internal consistency is coefficient alpha (or Cronbach’s alpha). Coefficient alpha can be interpreted like other reliability coefficients described here; the normal range of values is between .00 and _1.00, and higher values reflect a higher internal consistency.
  • 54.
     Coefficient alphais preferable to the split- half procedure because it gives an estimate of the split-half correlation for all possible ways of dividing the measure into two halves.  The split-half technique has been used to estimate homogeneity, but coefficient alpha is preferable. Neither approach considers fluctuations over time as a source of unreliability.
  • 55.
     Nurse researchersestimate a measure’s reliability by way of the equivalence approach primarily with observational measures.  Researchers should assess the reliability of observational instruments. In this case, “instrument” includes both the category and rating system and the observers making the measurements.
  • 56.
     Interrater (orinterobserver) reliability is estimated by having two or more trained observers watching an event simultaneously, and independently recording data according to the instrument’s instructions.  The data can then be used to compute an index of equivalence or agreement between observers. For certain types of observational data (e.g., ratings), correlation techniques are suitable.
  • 57.
     That is,a correlation coefficient is computed to demonstrate the strength of the relationship between one observer’s ratings and another’s.  Another procedure is to compute reliability as a function of agreements, using the following equation:  Number of agreements  Number of agreements _ disagreements
  • 58.
     This simpleformula unfortunately tends to overestimate observer agreements. If the behavior under investigation is one that observers code for absence or presence every, say, 10 seconds, the observers will agree 50% of the time by chance alone.  Other approaches to estimating interrater reliability may be of interest to advanced students. Techniques such as Cohen’s kappa, analysis of variance, intraclass correlations, and rank-order correlations have been used to assess interobserver reliability.
  • 59.
     Interpretation ofReliability Coefficients  Reliability coefficients are important indicators of an instrument’s quality.  Unreliable measures do not provide adequate tests of researchers’ hypotheses.  If data fail to confirm a prediction, one possibility is that the instruments were unreliable not necessarily that the expected relationships do not exist.  Knowledge about an instrument’s reliability thus is critical in interpreting research results, especially if research hypotheses are not supported.
  • 60.
     For group-levelcomparisons, coefficients in the vicinity of .70 are usually adequate, although coefficients of .80 or greater are highly desirable.  By group-level comparisons, we mean that researchers compare scores of groups, such as male versus female or experimental versus control subjects.
  • 61.
     If measuresare used for making decisions about individuals, then reliability coefficients ideally should be .90 or better.  For instance, if a test score was used as a criterion for admission to a graduate nursing program, then the accuracy of the test would be of critical importance to both individual applicants and the school of nursing.
  • 62.
     In general,items that elicit a 50_50 split (e.g., agree/disagree or correct/incorrect) have the best discriminating power. As a general guideline, if the split is 80/20 or worse, the item should probably be replaced.
  • 63.
     Another aspectof an item analysis is an inspection of the correlations between individual items and the overall scale score. Item-to-total correlations below .30 are usually considered unacceptably low.
  • 66.
     The extentto which a test measures what it was designed to measure.  Agreement between a test score or measure and the quality it is believed to measure.  Proliferation of definitions led to a dilution of the meaning of the word into all kinds of “validities”
  • 67.
     Internal validity– Cause and effect in experimentation; high levels of control; elimination of confounding variables  External validity - to what extent one may safely generalize the (internally valid) causal inference (a) from the sample studied to the defined target population and (b) to other populations (i.e. across time and space). Generalize to other people  Population validity – can the sample results be generalized to the target population  Ecological validity - whether the results can be applied to real life situations. Generalize to other (real) situations
  • 68.
     Content validity– when trying to measure a domain are all sub-domains represented  When measuring depression are all 16 clinical criteria represented in the items  Very complimentary to domain sampling theory and reliability  However, often high levels of content validity will lead to lower internal consistency reliability
  • 69.
     Construct validity– overall are you measuring what you are intending to measure  Intentional validity – are you measuring what you are intending and not something else. Requires that constructs be specific enough to differentiate  Representation validity or translation validity – how well have the constructs been translated into measureable outcomes. Validity of the operational definitions  Face validity – Does a test “appear” to be measuring the content of interest. Do questions about depression have the words “sad” or “depressed” in them
  • 70.
     Construct Validity Observation validity – how good are the measures themselves. Akin to reliability  Convergent validity - Convergent validity refers to the degree to which a measure is correlated with other measures that it is theoretically predicted to correlate with.  Discriminant validity - Discriminant validity describes the degree to which the operationalization does not correlate with other operationalizations that it theoretically should not correlated with.
  • 71.
     Criterion-Related Validity- the success of measures used for prediction or estimation. There are two types:  Concurrent validity - the degree to which a test correlates with an external criteria that is measured at the same time (e.g. does a depression inventory correlated with clinical diagnoses)  Predictive validity - the degree to which a test predicts (correlates) with an external criteria that is measured some time in the future (e.g. does a depression inventory score predict later clinical diagnosis)  Social validity – refers to the social importance and acceptability of a measure
  • 72.
     There isa total mess of “validities” and their definitions, what to do?  1985 - Joint Committee of  AERA: American Education Research Association  APA: American Psychological Association  NCME: National Council on Measurement in Education  developed Standards for Educational and Psychological Testing (revised in 1999).
  • 73.
     According tothe Joint Committee:  Validity is the evidence for inferences made about a test score.  Three types of evidence:  Content-related  Criterion-related  Construct-related  Different from the notion of “different types of validity”
  • 74.
     Content-related evidence(Content Validity)  Based upon an analysis of the body of knowledge surveyed.  Criterion-related evidence (Criterion Validity)  Based upon the relationship between scores on a particular test and performance or abilities on a second measure (or in real life).  Construct-related evidence (Construct Validity)  Based upon an investigation of the psychological constructs or characteristics of the test.
  • 75.
     Face Validity The mere appearance that a test has validity.  Does the test look like it measures what it is supposed to measure?  Do the items seem to be reasonably related to the perceived purpose of the test.  Does a depression inventory ask questions about being sad?  Not a “real” measure of validity, but one that is commonly seen in the literature.  Not considered legitimate form of validity by the Joint Committee.
  • 76.
     Does thetest adequately sample the content or behavior domain that it is designed to measure?  If items are not a good sample, results of testing will be misleading.  Usually developed during test development.  Not generally empirically evaluated.  Judgment of subject matter experts.
  • 77.
     To developa test with high content-related evidence of validity, you need:  good logic  intuitive skills  Perseverance  Must consider:  wording  reading level
  • 78.
     Other content-relatedevidence terms  Construct underrepresentation: failure to capture important components of a construct.  Test is designed for chapters 1-10 but only chapters 1- 8 show up on the test.  Construct-irrelevant variance: occurs when scores are influenced by factors irrelevant to the construct.  Test is well-intentioned, but problems secondary to the test negatively influence the results (e.g., reading level, vocabulary, unmeasured secondary domains)
  • 79.
     Tells ushow well a test corresponds with a particular criterion  criterion: behavioral or measurable outcome  SAT predicting GPA (GPA is criterion)  BDI scores predicting suicidality (suicide is criterion).  Used to “predict the future” or “predict the present.”
  • 80.
     Predictive ValidityEvidence  forecasting the future  how well does a test predict future outcomes  SAT predicting 1st yr GPA  most tests don’t have great predictive validity  decrease due to time & method variance
  • 81.
     Concurrent ValidityEvidence  forecasting the present  how well does a test predict current similar outcomes  job samples, alternative tests used to demonstrate concurrent validity evidence  generally higher than predictive validity estimates
  • 82.
     Validity Coefficient correlation between the test and the criterion  usually between .30 and .60 in real life.  In general, as long as they are statistically significant, evidence is considered valid.  However,  recall that r2 indicates explained variance.  SO, in reality, we are only looking at explained criterion variance in the range of 9 to 36%.  Sound Problematic??
  • 83.
     Look forchanges in the cause of relationships. (third variable effect)  E.g. Situational factors during validation that are replicated in later uses of the scale  Examine what the criterion really means.  Optimally the criterion should be something the test is trying to measure  If the criterion is not valid and reliable, you have no evidence of criterion-related validity!  Review the subject population in the validity study.  If the normative sample is not representative, you have little evidence of criterion-related validity.
  • 84.
     Ensure thesample size in the validity study was adequate.  Never confuse the criterion with the predictor.  GREs are used to predict success in grad school  Some grad programs may admit low GRE students but then require a certain GRE before they can graduate.  So, low GRE scores succeed, this demonstrates poor predictive validity!  But the process was dumb to begin with…  Watch for restricted ranges.
  • 85.
     Review evidencefor validity generalization.  Tests only given in laboratory settings, then expected to demonstrate validity in classrooms?  Ecological validity?  Consider differential prediction.  Just because a test has good predictive validity for the normative sample may not ensure good predictive validity for people outside the normative sample.  External validity?
  • 86.
     Construct: somethingconstructed by mental synthesis  What is Intelligence? Love? Depression?  Construct Validity Evidence  assembling evidence about what a test means (and what it doesn’t)  sequential process; generally takes several studies
  • 87.
     Convergent Evidence obtained when a measure correlates well with other tests believed to measure the same construct.  Self-report, collateral-report measures  Discriminant Evidence  obtained when a measure correlates less strong with other tests believed to measure something slightly different  This does not mean any old test that you know won’t correlate; should be something that could be related but you want to show is separate  Example: IQ and Achievement Tests
  • 88.
     Standard Errorof Estimate:  standard error of estimate  standard deviation of the test  validity of the test  Essentially, this is regression all over again. 2 ˆ. 1 (1 ) 2 est yY Y N s s s r N          .ests ys r
  • 89.
     Maximum Validitydepends on Reliability  is the maximum validity  is the reliability of test 1  is the reliability of test 1 12max 1 2r rr 12maxr 1r 2r
  • 90.
    Reliability of TestReliability of Criterion Maximum Validity (Correlation) 1 1 1.00 0.8 1 0.89 0.6 1 0.77 0.4 1 0.63 0.2 1 0.45 0 1 0.00 1 0.5 0.71 0.8 0.5 0.63 0.6 0.5 0.55 0.4 0.5 0.45 0.2 0.5 0.32 0 0.5 0.00 1 0.2 0.45 0.8 0.2 0.40 0.6 0.2 0.35 0.4 0.2 0.28 0.2 0.2 0.20 0 0.2 0.00 1 0 0.00 0.8 0 0.00 0.6 0 0.00 0.4 0 0.00 0.2 0 0.00 0 0 0.00