Reliability is fundamentalto all aspects of
measurement (the accuracy of the actual
measuring instrument or procedure), because
without it we cannot have confidence in the
data we collect, nor can we draw rational
conclusions from those data.
Reliability
3.
1. Test-retest reliability(Stability)
We administer the same test to the same
sample on two different occasions and see
if they yield the same result.
Types of Reliability
4.
♦ Test-retest reliability
-Test-retest Intervals
Because the stability of a response
variable is such a significant factor, the time
interval between tests must be considered
carefully. Intervals should be far enough
apart to avoid fatigue, learning, or memory
effects, but close enough to avoid genuine
changes in the measured variable.
5.
- Carryover andTesting Effects
With two or more measures, reliability
can be influenced by effect of the first test on
the outcome of the second test.
For example, practice or carryover effects can
occur with repeated measurements, changing
performance on subsequent trials.
6.
2. Rater reliability(Reproducibility)
Many clinical measurements require that
a human observer, or rater, be part of the
measurement system. In some cases, the
rater is the actual measuring instrument.
Types of Reliability (cont.)
7.
- Inter-rater reliability
Usedto assess the degree to which
different raters/observers give consistent
estimates of the same phenomenon.
♦ Rater reliability
8.
- Intra-rater reliability
Thestability of data recorded by one
individual across two or more trials.
In a test-retest situation, intra-rater
reliability and test-retest reliability are
essentially the same estimate.
♦ Rater reliability (cont.)
9.
3. Alternate forms(Parallel forms) reliability
Used to assess the consistency of the
results of two tests constructed in the same
way from the same content domain.
Types of Reliability (cont.)
10.
Many measuring instrumentsexist in two or
more versions, called equivalent, parallel or alternate
form.
Interchange of these alternate forms can be
supported only by establishing their parallel
reliability.
Alternate forms reliability testing is often used
as an alternative to test-retest reliability with paper-
and-pencil tests, when the nature of the test is such
that subjects are likely to recall their responses to
test items.
Example Scholastic Aptitude Test (SAT), Graduate
Record Examination (GRE)
11.
Validity
• The abilityof a scale to measure what it is
intended to measure (the study's success at
measuring whatthe researchers set out to
measure)
• The extent to which a measure reflects the
real meaning of the concept under consideration
• The extent to which a measure reflects the
opinions and behaviors of the population under
investigation
• Can not be valid unless also reliable
12.
Validity
• Depends onthe Purpose of the measure
- E.g. a ruler may be a valid measuring
device for length, but isn’t very valid for
measuring volume
• Measuring what ‘it’ is supposed to
• Must be inferred from evidence; cannot be
directly measured
13.
1. Face validity
2.Content validity
3. Pragmatic (criterion-related) validity
A. Concurrent validity
B. Predictive validity
4. Construct validity
A. Convergent validity
B. Discriminant validity
Types of Validity
14.
1. Face Validity
Indicatesthat an instrument appears to
test what it is supposed to; the weakest form
of measurement validity.
This type of validity relies basically upon
the subjective judgment of the researcher.
15.
Face Validity (cont.)
Itasks two questions which the researcher
must finally answer in accordance with best
judgment:
(1) Is the instrument measuring what it is
supposed to measure?
(2) Is the sample being measured
adequate to be representative of the behavior
or trait being measured?
16.
2. Content Validity
Indicatesthat the items that make up an
instrument adequately sample the universe of
content that defines the variable being
measured. Therefore, the instrument does
contain all the elements that reflect the variable
being studied. Most useful with questionnaires,
examination and inventories.
This type of validity is sometimes equated
with face validity.
17.
Content Validity (cont.)
Forexample, we are interested in the content
validity of questions being asked to elicit familiarity
with a certain area of knowledge, content validity
would be concerned with how accurately the
questions asked tend to elicit the information
sought.
There are no statistical indicies that can
assess content validity. Claims for content validation
are made by a panel of “experts” who review the
instrument and determine if the questions satisfy the
content domain. This process often requires several
revisions of the test. When all agree that the content
domain has been sampled adequately, content
validity is supported.
18.
is the mostpractical and objective approach
to validity testing. It is based on one test to predict
results obtained on external criterion. The test to be
validated, call the target test, is compared with a
gold standard or criterion measure that is already
established or assumed to be valid.
For example, we can investigate the validity of heart
rate (the target test) as an indicator of energy cost
during exercise by correlating it with values
obtained in standardized oxygen consumption
studies (the criterion measure).
3. Criterion-Related (Pragmatic) Validity
1. Concurrent validity
isstudied when the measurement to be
validated and the criterion measure are taken at
relatively the same time (concurrently), so that they
both reflect the same incident of behavior.
Concurrent validity is also useful in situations
where a new or untested tool is potentially more
efficient, easier to administer, more practical, or safer
than another more established method, and is being
proposed as an alternative.
• e.g., Does a new version of an IQ test more efficient
than the past versions?
3. Criterion-Related Validity (cont.)
21.
2. Predictive validity
attemptsto establish that a measure will be a
valid predictor of some future criterion score.
To assess predictive validity, a target test is
given at one session and is followed by a period of
time after which the criterion score is obtained.
• e.g., Scholastic Aptitude Test (SAT) scores: Do they
predict college GPA?
3. Criterion-Related Validity (cont.)
22.
4. Construct Validity
reflectsthe ability of an instrument to
measure an abstract concept, or construct.
A construct is any concept (not real),
such as honesty, which cannot be directly
observed or isolated. Construct validation is
interested in the degree to which the
construct itself is actually measured.
23.
♦ Assessing constructvalidity:
– Convergent validity
– Discriminant (Divergent) validity
4. Construct Validity (cont.)
The construct validity of a test can be
evaluated in terms of how its measures relate
to other tests of the same and different
constructs. In other words, it is important to
determine what a test does measure as well as
what it does not measure.
24.
♦ Convergent validity:
–Measuring the same concept with very
different methods
– If different methods yield the same results, then
convergent validity is supported
– e.g., Different survey items to used to measure
decision-making style - closed and open-ended
• Code for decision-making style from
open-ended responses
• High score on scale = more compensatory
responses
Convergent Validity
25.
We theorize thatall four items reflect the idea of self esteem (this is why I labeled the
top part of the figure Theory). On the bottom part of the figure (Observation) we see
the intercorrelations of the four scale items. This might be based on giving our scale
out to a sample of respondents. You should readily see that the item intercorrelations
for all item pairings are very high (remember that correlations range from -1.00 to
+1.00). This provides evidence that our theory that all four items are related to the
same construct is supported.
26.
♦ Discriminant validity:
indicatesthat different results, or low
correlations, are expected from measures that
are believed to assess different characteristics.
To establish discriminant validity, you need
to show that measures that should not be
related are in reality not related.
Ex. the results of an intelligence test should
not be expected to correlate with results of a
test of gross motor skill
Discriminant (Divergent) Validity
27.
There are fourcorrelations between measures that reflect different
constructs, and these are shown on the bottom of the figure (Observation).
You should see immediately that these four cross-construct correlations are
very low (i.e., near zero) and certainly much lower than the convergent
correlations in the previous figure.
The correlations do provide evidence that the two sets of measures are
discriminated from each other.
28.
♦ Internal
– Controllingfor other factors in the design
• Validity of structure, sampling, measures,
procedures
• Claims regarding what happened in the study
♦ External
– Looking beyond the design to other cases
• Validity of inferences made from the conclusions
• Claims regarding what happens in the real world
Validity & Research Design
29.
Threats to Validity
Threatsto Internal Validity
1. History
is a plausible explanation for experimental
effects when some outside event that affects outcome
measures occurs during the course of a study. In any
study that may involve an evaluation, treatment and
further evaluation, some subjects will undergo events
that may affect the final evaluation.
For example, in a study of the effects of an exercise
programme upon hypertension, some of the patients
might take up additional exercise, such as playing
tennis.
30.
2. Maturation
concerns notevents, but the mere
passage of time, and therefore may be of
particularly serious concern in health science
research. Maturation refers to time-
dependent internal changes in subjects.
As children grow older certain skills
appear and develop independent of treatment
or training.
31.
3. Testing
If thesame instrument is employed in
pre-treatment and post-treatment evaluations,
subjects are more familiar with the instrument
in the post-treatment situation and usually
score higher. These are sometimes called
practice effects.
32.
4. Instrumentation
If theinstrument of measurement is a
human observer, the instrument itself may
exhibit practice effects. As they gain
experience, raters tend to perceive phenomena
differently.
Mechanistic instrumentation such as
measurement scales and even physical
equipment may represent instrumentation
threats to validity either if the device is
differentially sensitive at various levels along
the range of measurement or if the equipment
itself requires (and does not receive)
appropriate adjustment and recalibration.
33.
5. Statistical regression
(Regressionto the mean)
This is a special effect that originates from the
unreliability of test measures. Often, clinical research
involves the selection of patients for study who have
achieved particularly low or high scores on one or more
measures, for example the most depressed patients or
those with the highest measured cholesterol levels. If you
test these people again-regardless of what you do to them
in the interim-things will also have leveled off for them.
Scores will move in the direction of the “average” score.
This high initial scores will tend to drop, low initial scores
will tend to increase, and moderate scores will tend to
change very little. This happens because on the second
measurement, the measurement error tends to be less.
34.
5. Statistical regression(Cont.)
If subjects are chosen on the basis of very good or
very poor performance on an initial assessment, they
are likely to include a number of cases in which
measurement error is quite high, and proportionately
few with small measurement error. In such cases the
regression to the mean phenomenon is a possibility,
as on the second measurement there is likely to be
less extreme measurement error, on average.
35.
6. Selection orassignment errors
concerns the way in which subjects are
placed into experimental groups. The groups
being compared may, due to bad assignment
or selection procedures, be different at the
outset, rather than as a result of any treatment
effects.
This might well happen if the subjects
were not randomly assigned into treatment
groups. Appropriate sampling procedures help
to ensure against this threat to validity.
36.
7. Mortality
deals withthe loss of subjects to the
research over time for various reasons. It is
particularly important for the researcher to
note, whenever possible, reasons for the
defection of subjects and to pay especially
close attention to attrition rates across
experimental conditions.
As the subjects who drop out might be
different to those who stay, the experimental
and the control group no longer remain
equivalent.
9. Selection-history
is particularlya threat when experimental
groups are separate geographically in different
states or even different hospitals. An event of
local importance may well affect only one of
your groups.
39.
10. Selection-instrumentation
occur whenexperimental groups’ average
scores differ on an instrument where the
intervals are not equal across the entire range
of the measure.
An example would be a thermometer that was
more sensitive below 100 degrees than above
200 degrees.
40.
Threats to ExternalValidity
1. Interaction of selection and treatment
A particular threat to external validity is
the use of volunteer subjects. Volunteers
often have characteristics that differ from
those of the general population.
41.
2. Interaction ofsetting and treatment
involves the extent to which the experimental
setting resembles the setting in which treatment
will be received in post-experimental situations.
This problem is less severe in the health
sciences than in the behavioral and social
sciences, where college students often serve as
subjects for research intend to be generalized to
industrial or other non-academic settings. The
control and similarity of medical care facilities
minimizes this threat for health science
professionals.
42.
3. Interaction ofhistory and treatment
involves differences in the ways that various treatment
groups may be affected by events occurring during the
course of the research.
For example, the treatment might be either enhanced
by an outside event while the control group remains
unaffected.
At this point you may be wondering how researchers
ever manage to design a completely valid study. The truth is
that no one ever does. There is no perfect study, only studies
containing degrees of imperfection. This is why studies must
be replicated in other settings, by other researchers, using
other operations and methods. Good researchers are
constantly aware of the pitfalls of study design presented by
threats to validity and try to avoid as many of them as
possible.
43.
Reliability and Validity
Reliability
•How accurate or
consistent is the measure?
• Would two people
understand a question
in the same way?
• Would the same person
give the same answers
under similar
circumstances?
Validity
• Does the concept measure
what it is intended to
measure?
• Does the measure actually
reflect the concept?
• Do the findings reflect the
opinions, attitudes, and
behaviors of the target
population?
44.
Example
Suppose that youhave bathroom weight scales and these
weight scales are broken. The weight scales will represent
the methodology. One person weighs you with these scales
and obtains a result. Then, the weight scales are passed
along to another person. The second person follows the
same procedure, uses the same weight scales and weighs
you. The same broken weigh scales are used. The two
people, using the same broken weight scales, come to
similar measures. The results are reliable. The results are
obtained by two (or perhaps more) people using the faulty
scale. Although the results are reliable, they may not be
valid. That is, by using the faulty scales,
the results are not a true indicator of the real weight.
The Reliability andValidity Relationship
• An instrument that is valid is always reliable
• An instrument that is not valid may or may
not be reliable
• An instrument that is reliable may or may
not be valid
54.
Bias (systematic error)
-deviation from truth in one direction
- a dirty dirt
Random error (random variation, chance, noise)
- deviation from truth in both directions i.e., above &
below
- mean = pop’n true value
- a clean dirt
- cannot be totally eliminated
- estimated and reduced by statistics
In most situations, there are both bias and chance.
Measured value = True value + Systematic error
+ Random error
55.
Sources of MeasurementError
1. Individual taking the measurements
(often called the tester or rater)
2. Measuring instrument
3. Variability of the characteristic being
measured
56.
Sources of MeasurementError (cont.)
Many sources of error can be minimized
through careful planning, training, clear
operational definitions and inspection of
equipment.
Therefore, a testing protocol should
thoroughly describe the method of
measurement, which must be uniformly
performed across trials.
Isolating and defining each element of the
measure reduces the potential for error,
thereby improving reliability.
57.
- Dichotomous (2x 2 table)
- sensitivity, specificity, accuracy
Gold Standard (Truth)
+ve -ve
Test +ve a b
–ve c d
a+c b+d
Sensitivity = a / (a+c)
Specificity = d / (b+d)
Accuracy = (a+d) / (a+b+c+d)
- Continuous
1–sample (paired) t–test
1) Statistical measures of validity:
58.
Assumptions:
- Subjects areindependent
- Observers are independent
- Categories of (nominal, ordinal) scale are
independent,
mutually exclusive
exhaustive
2) Statistical measures of reliability:
59.
- Nominal data
Dichotomous 2 x 2 table: Kappa
Polychotomous r x r table: Overall Kappa
Individual Kappa
Degree of agreement:
K > 0.75 : excellent agreement beyond chance
0.40 K 0.75 : fair - good
K < 0.40 : poor
Symmetric Me asures
.245.134 2.498 .013
100
Kappa
Measure of Agreement
N of Valid Cases
Value
Asymp.
Std. Error
a Approx. T
b Approx. Sig.
Not assuming the null hypothesis.
a.
Using the asymptotic standard error assuming the null hypothesis.
b.
SPSS Output
64.
- Continuous data:Intraclass correlation (ICC, rI)
ICC:
1. 2 observers
2. Each subject can have different # of observers
3. Be applied to ordinal data (when intervals b/t category
are assumed to be equivalent)
References
1. Department ofClinical Epidemiology and Biostatistics, McMaster University,
Hamilton, Ontario. Clinical Disagreement: I. How often it occurs and why.
Can Med Assoc J 1980; 123: 499-504.
2. Fleiss JL, Levin B, Cho Paik M. Statistical Methods for Rates and Proportions
3rd Ed. New York : John Wiley & Sons, 2002.
3. Fletcher RH, Fletcher SW, Wagner EH. Clinical Epidemiology : the
Essentials.
Baltimore : Williams & Wilkins, 1988.
4. Leedy D. P. Practical Research, Planning and Design. 3rd ed. New York:
Macmillan Publishing Company, 1985.
5. Maclure M, Willett WC. Misinterpretation and Misuse of the Kappa Statistic.
Am J Epidemiol 1987; 126: 161-9.