Validity and Reliability
Nipa Rojroongwasinkul, Ph.D.
Institute of Nutrition
Mahidol University
Reliability is fundamental to all aspects of
measurement (the accuracy of the actual
measuring instrument or procedure), because
without it we cannot have confidence in the
data we collect, nor can we draw rational
conclusions from those data.
Reliability
1. Test-retest reliability (Stability)
We administer the same test to the same
sample on two different occasions and see
if they yield the same result.
Types of Reliability
♦ Test-retest reliability
- Test-retest Intervals
Because the stability of a response
variable is such a significant factor, the time
interval between tests must be considered
carefully. Intervals should be far enough
apart to avoid fatigue, learning, or memory
effects, but close enough to avoid genuine
changes in the measured variable.
- Carryover and Testing Effects
With two or more measures, reliability
can be influenced by effect of the first test on
the outcome of the second test.
For example, practice or carryover effects can
occur with repeated measurements, changing
performance on subsequent trials.
2. Rater reliability (Reproducibility)
Many clinical measurements require that
a human observer, or rater, be part of the
measurement system. In some cases, the
rater is the actual measuring instrument.
Types of Reliability (cont.)
- Inter-rater reliability
Used to assess the degree to which
different raters/observers give consistent
estimates of the same phenomenon.
♦ Rater reliability
- Intra-rater reliability
The stability of data recorded by one
individual across two or more trials.
In a test-retest situation, intra-rater
reliability and test-retest reliability are
essentially the same estimate.
♦ Rater reliability (cont.)
3. Alternate forms (Parallel forms) reliability
Used to assess the consistency of the
results of two tests constructed in the same
way from the same content domain.
Types of Reliability (cont.)
Many measuring instruments exist in two or
more versions, called equivalent, parallel or alternate
form.
Interchange of these alternate forms can be
supported only by establishing their parallel
reliability.
Alternate forms reliability testing is often used
as an alternative to test-retest reliability with paper-
and-pencil tests, when the nature of the test is such
that subjects are likely to recall their responses to
test items.
Example Scholastic Aptitude Test (SAT), Graduate
Record Examination (GRE)
Validity
• The ability of a scale to measure what it is
intended to measure (the study's success at
measuring whatthe researchers set out to
measure)
• The extent to which a measure reflects the
real meaning of the concept under consideration
• The extent to which a measure reflects the
opinions and behaviors of the population under
investigation
• Can not be valid unless also reliable
Validity
• Depends on the Purpose of the measure
- E.g. a ruler may be a valid measuring
device for length, but isn’t very valid for
measuring volume
• Measuring what ‘it’ is supposed to
• Must be inferred from evidence; cannot be
directly measured
1. Face validity
2. Content validity
3. Pragmatic (criterion-related) validity
A. Concurrent validity
B. Predictive validity
4. Construct validity
A. Convergent validity
B. Discriminant validity
Types of Validity
1. Face Validity
Indicates that an instrument appears to
test what it is supposed to; the weakest form
of measurement validity.
This type of validity relies basically upon
the subjective judgment of the researcher.
Face Validity (cont.)
It asks two questions which the researcher
must finally answer in accordance with best
judgment:
(1) Is the instrument measuring what it is
supposed to measure?
(2) Is the sample being measured
adequate to be representative of the behavior
or trait being measured?
2. Content Validity
Indicates that the items that make up an
instrument adequately sample the universe of
content that defines the variable being
measured. Therefore, the instrument does
contain all the elements that reflect the variable
being studied. Most useful with questionnaires,
examination and inventories.
This type of validity is sometimes equated
with face validity.
Content Validity (cont.)
For example, we are interested in the content
validity of questions being asked to elicit familiarity
with a certain area of knowledge, content validity
would be concerned with how accurately the
questions asked tend to elicit the information
sought.
There are no statistical indicies that can
assess content validity. Claims for content validation
are made by a panel of “experts” who review the
instrument and determine if the questions satisfy the
content domain. This process often requires several
revisions of the test. When all agree that the content
domain has been sampled adequately, content
validity is supported.
is the most practical and objective approach
to validity testing. It is based on one test to predict
results obtained on external criterion. The test to be
validated, call the target test, is compared with a
gold standard or criterion measure that is already
established or assumed to be valid.
For example, we can investigate the validity of heart
rate (the target test) as an indicator of energy cost
during exercise by correlating it with values
obtained in standardized oxygen consumption
studies (the criterion measure).
3. Criterion-Related (Pragmatic) Validity
♦ Criterion-related validity is separated into 2
components:
1. Concurrent validity
2. Predictive validity
3. Criterion-Related Validity (cont.)
1. Concurrent validity
is studied when the measurement to be
validated and the criterion measure are taken at
relatively the same time (concurrently), so that they
both reflect the same incident of behavior.
Concurrent validity is also useful in situations
where a new or untested tool is potentially more
efficient, easier to administer, more practical, or safer
than another more established method, and is being
proposed as an alternative.
• e.g., Does a new version of an IQ test more efficient
than the past versions?
3. Criterion-Related Validity (cont.)
2. Predictive validity
attempts to establish that a measure will be a
valid predictor of some future criterion score.
To assess predictive validity, a target test is
given at one session and is followed by a period of
time after which the criterion score is obtained.
• e.g., Scholastic Aptitude Test (SAT) scores: Do they
predict college GPA?
3. Criterion-Related Validity (cont.)
4. Construct Validity
reflects the ability of an instrument to
measure an abstract concept, or construct.
A construct is any concept (not real),
such as honesty, which cannot be directly
observed or isolated. Construct validation is
interested in the degree to which the
construct itself is actually measured.
♦ Assessing construct validity:
– Convergent validity
– Discriminant (Divergent) validity
4. Construct Validity (cont.)
The construct validity of a test can be
evaluated in terms of how its measures relate
to other tests of the same and different
constructs. In other words, it is important to
determine what a test does measure as well as
what it does not measure.
♦ Convergent validity:
– Measuring the same concept with very
different methods
– If different methods yield the same results, then
convergent validity is supported
– e.g., Different survey items to used to measure
decision-making style - closed and open-ended
• Code for decision-making style from
open-ended responses
• High score on scale = more compensatory
responses
Convergent Validity
We theorize that all four items reflect the idea of self esteem (this is why I labeled the
top part of the figure Theory). On the bottom part of the figure (Observation) we see
the intercorrelations of the four scale items. This might be based on giving our scale
out to a sample of respondents. You should readily see that the item intercorrelations
for all item pairings are very high (remember that correlations range from -1.00 to
+1.00). This provides evidence that our theory that all four items are related to the
same construct is supported.
♦ Discriminant validity:
indicates that different results, or low
correlations, are expected from measures that
are believed to assess different characteristics.
To establish discriminant validity, you need
to show that measures that should not be
related are in reality not related.
Ex. the results of an intelligence test should
not be expected to correlate with results of a
test of gross motor skill
Discriminant (Divergent) Validity
There are four correlations between measures that reflect different
constructs, and these are shown on the bottom of the figure (Observation).
You should see immediately that these four cross-construct correlations are
very low (i.e., near zero) and certainly much lower than the convergent
correlations in the previous figure.
The correlations do provide evidence that the two sets of measures are
discriminated from each other.
♦ Internal
– Controlling for other factors in the design
• Validity of structure, sampling, measures,
procedures
• Claims regarding what happened in the study
♦ External
– Looking beyond the design to other cases
• Validity of inferences made from the conclusions
• Claims regarding what happens in the real world
Validity & Research Design
Threats to Validity
Threats to Internal Validity
1. History
is a plausible explanation for experimental
effects when some outside event that affects outcome
measures occurs during the course of a study. In any
study that may involve an evaluation, treatment and
further evaluation, some subjects will undergo events
that may affect the final evaluation.
For example, in a study of the effects of an exercise
programme upon hypertension, some of the patients
might take up additional exercise, such as playing
tennis.
2. Maturation
concerns not events, but the mere
passage of time, and therefore may be of
particularly serious concern in health science
research. Maturation refers to time-
dependent internal changes in subjects.
As children grow older certain skills
appear and develop independent of treatment
or training.
3. Testing
If the same instrument is employed in
pre-treatment and post-treatment evaluations,
subjects are more familiar with the instrument
in the post-treatment situation and usually
score higher. These are sometimes called
practice effects.
4. Instrumentation
If the instrument of measurement is a
human observer, the instrument itself may
exhibit practice effects. As they gain
experience, raters tend to perceive phenomena
differently.
Mechanistic instrumentation such as
measurement scales and even physical
equipment may represent instrumentation
threats to validity either if the device is
differentially sensitive at various levels along
the range of measurement or if the equipment
itself requires (and does not receive)
appropriate adjustment and recalibration.
5. Statistical regression
(Regression to the mean)
This is a special effect that originates from the
unreliability of test measures. Often, clinical research
involves the selection of patients for study who have
achieved particularly low or high scores on one or more
measures, for example the most depressed patients or
those with the highest measured cholesterol levels. If you
test these people again-regardless of what you do to them
in the interim-things will also have leveled off for them.
Scores will move in the direction of the “average” score.
This high initial scores will tend to drop, low initial scores
will tend to increase, and moderate scores will tend to
change very little. This happens because on the second
measurement, the measurement error tends to be less.
5. Statistical regression (Cont.)
If subjects are chosen on the basis of very good or
very poor performance on an initial assessment, they
are likely to include a number of cases in which
measurement error is quite high, and proportionately
few with small measurement error. In such cases the
regression to the mean phenomenon is a possibility,
as on the second measurement there is likely to be
less extreme measurement error, on average.
6. Selection or assignment errors
concerns the way in which subjects are
placed into experimental groups. The groups
being compared may, due to bad assignment
or selection procedures, be different at the
outset, rather than as a result of any treatment
effects.
This might well happen if the subjects
were not randomly assigned into treatment
groups. Appropriate sampling procedures help
to ensure against this threat to validity.
7. Mortality
deals with the loss of subjects to the
research over time for various reasons. It is
particularly important for the researcher to
note, whenever possible, reasons for the
defection of subjects and to pay especially
close attention to attrition rates across
experimental conditions.
As the subjects who drop out might be
different to those who stay, the experimental
and the control group no longer remain
equivalent.
8. Selection-maturation
threats the various experimental groups
experience maturational change at different
rates.
9. Selection-history
is particularly a threat when experimental
groups are separate geographically in different
states or even different hospitals. An event of
local importance may well affect only one of
your groups.
10. Selection-instrumentation
occur when experimental groups’ average
scores differ on an instrument where the
intervals are not equal across the entire range
of the measure.
An example would be a thermometer that was
more sensitive below 100 degrees than above
200 degrees.
Threats to External Validity
1. Interaction of selection and treatment
A particular threat to external validity is
the use of volunteer subjects. Volunteers
often have characteristics that differ from
those of the general population.
2. Interaction of setting and treatment
involves the extent to which the experimental
setting resembles the setting in which treatment
will be received in post-experimental situations.
This problem is less severe in the health
sciences than in the behavioral and social
sciences, where college students often serve as
subjects for research intend to be generalized to
industrial or other non-academic settings. The
control and similarity of medical care facilities
minimizes this threat for health science
professionals.
3. Interaction of history and treatment
involves differences in the ways that various treatment
groups may be affected by events occurring during the
course of the research.
For example, the treatment might be either enhanced
by an outside event while the control group remains
unaffected.
At this point you may be wondering how researchers
ever manage to design a completely valid study. The truth is
that no one ever does. There is no perfect study, only studies
containing degrees of imperfection. This is why studies must
be replicated in other settings, by other researchers, using
other operations and methods. Good researchers are
constantly aware of the pitfalls of study design presented by
threats to validity and try to avoid as many of them as
possible.
Reliability and Validity
Reliability
• How accurate or
consistent is the measure?
• Would two people
understand a question
in the same way?
• Would the same person
give the same answers
under similar
circumstances?
Validity
• Does the concept measure
what it is intended to
measure?
• Does the measure actually
reflect the concept?
• Do the findings reflect the
opinions, attitudes, and
behaviors of the target
population?
Example
Suppose that you have bathroom weight scales and these
weight scales are broken. The weight scales will represent
the methodology. One person weighs you with these scales
and obtains a result. Then, the weight scales are passed
along to another person. The second person follows the
same procedure, uses the same weight scales and weighs
you. The same broken weigh scales are used. The two
people, using the same broken weight scales, come to
similar measures. The results are reliable. The results are
obtained by two (or perhaps more) people using the faulty
scale. Although the results are reliable, they may not be
valid. That is, by using the faulty scales,
the results are not a true indicator of the real weight.
Not Valid and Not Reliable
Some Validity Not Reliable
Not Valid but Reliable
Valid and Reliable
The Reliability and Validity Relationship
• An instrument that is valid is always reliable
• An instrument that is not valid may or may
not be reliable
• An instrument that is reliable may or may
not be valid
Bias (systematic error)
- deviation from truth in one direction
- a dirty dirt
Random error (random variation, chance, noise)
- deviation from truth in both directions i.e., above &
below
- mean = pop’n true value
- a clean dirt
- cannot be totally eliminated
- estimated and reduced by statistics
In most situations, there are both bias and chance.
Measured value = True value + Systematic error
+ Random error
Sources of Measurement Error
1. Individual taking the measurements
(often called the tester or rater)
2. Measuring instrument
3. Variability of the characteristic being
measured
Sources of Measurement Error (cont.)
Many sources of error can be minimized
through careful planning, training, clear
operational definitions and inspection of
equipment.
Therefore, a testing protocol should
thoroughly describe the method of
measurement, which must be uniformly
performed across trials.
Isolating and defining each element of the
measure reduces the potential for error,
thereby improving reliability.
- Dichotomous (2 x 2 table)
- sensitivity, specificity, accuracy
Gold Standard (Truth)
+ve -ve
Test +ve a b
–ve c d
a+c b+d
Sensitivity = a / (a+c)
Specificity = d / (b+d)
Accuracy = (a+d) / (a+b+c+d)
- Continuous
1–sample (paired) t–test
1) Statistical measures of validity:
Assumptions:
- Subjects are independent
- Observers are independent
- Categories of (nominal, ordinal) scale are
independent,
mutually exclusive
exhaustive
2) Statistical measures of reliability:
- Nominal data
Dichotomous  2 x 2 table: Kappa
Polychotomous  r x r table: Overall Kappa
Individual Kappa
Degree of agreement:
K > 0.75 : excellent agreement beyond chance
0.40  K  0.75 : fair - good
K < 0.40 : poor
Kappa
SPSS
Symmetric Me asures
.245 .134 2.498 .013
100
Kappa
Measure of Agreement
N of Valid Cases
Value
Asymp.
Std. Error
a Approx. T
b Approx. Sig.
Not assuming the null hypothesis.
a.
Using the asymptotic standard error assuming the null hypothesis.
b.
SPSS Output
- Continuous data: Intraclass correlation (ICC, rI)
ICC:
1.  2 observers
2. Each subject can have different # of observers
3. Be applied to ordinal data (when intervals b/t category
are assumed to be equivalent)
ICC
References
1. Department of Clinical Epidemiology and Biostatistics, McMaster University,
Hamilton, Ontario. Clinical Disagreement: I. How often it occurs and why.
Can Med Assoc J 1980; 123: 499-504.
2. Fleiss JL, Levin B, Cho Paik M. Statistical Methods for Rates and Proportions
3rd Ed. New York : John Wiley & Sons, 2002.
3. Fletcher RH, Fletcher SW, Wagner EH. Clinical Epidemiology : the
Essentials.
Baltimore : Williams & Wilkins, 1988.
4. Leedy D. P. Practical Research, Planning and Design. 3rd ed. New York:
Macmillan Publishing Company, 1985.
5. Maclure M, Willett WC. Misinterpretation and Misuse of the Kappa Statistic.
Am J Epidemiol 1987; 126: 161-9.
References (Cont.)
6. Oyster C.K., Hanten W.P., Llorens L.A. Interduction to Research.
Philadelphia: Lippincott, 1987.
7. Portney LG, Watkins MP. Foundations of Clinical Research: Application to
Practice. Connecticut : Appleton & Lange, 1993.
8. กัลยา วานิชย์บัญชา. สถิติสาหรับงานวิจัย พิมพ์ครั้งที่ 2 กรุงเทพฯ : ภาควิชา
สถิติ คณะพาณิชยศาสตร์และการบัญชี จุฬาลงกรณ์มหาวิทยาลัย, 2549
9. สังวาลย์ รักษ์เผ่า. ระเบียบวิธีวิจัยและสถิติในการวิจัยทางคลินิก เชียงใหม่:
โครงการตารา คณะแพทยศาสตร์ มหาวิทยาลัยเชียงใหม่, 2539

8. Validity reliability-NUFN667.JHJHJHJHJpdf

  • 1.
    Validity and Reliability NipaRojroongwasinkul, Ph.D. Institute of Nutrition Mahidol University
  • 2.
    Reliability is fundamentalto all aspects of measurement (the accuracy of the actual measuring instrument or procedure), because without it we cannot have confidence in the data we collect, nor can we draw rational conclusions from those data. Reliability
  • 3.
    1. Test-retest reliability(Stability) We administer the same test to the same sample on two different occasions and see if they yield the same result. Types of Reliability
  • 4.
    ♦ Test-retest reliability -Test-retest Intervals Because the stability of a response variable is such a significant factor, the time interval between tests must be considered carefully. Intervals should be far enough apart to avoid fatigue, learning, or memory effects, but close enough to avoid genuine changes in the measured variable.
  • 5.
    - Carryover andTesting Effects With two or more measures, reliability can be influenced by effect of the first test on the outcome of the second test. For example, practice or carryover effects can occur with repeated measurements, changing performance on subsequent trials.
  • 6.
    2. Rater reliability(Reproducibility) Many clinical measurements require that a human observer, or rater, be part of the measurement system. In some cases, the rater is the actual measuring instrument. Types of Reliability (cont.)
  • 7.
    - Inter-rater reliability Usedto assess the degree to which different raters/observers give consistent estimates of the same phenomenon. ♦ Rater reliability
  • 8.
    - Intra-rater reliability Thestability of data recorded by one individual across two or more trials. In a test-retest situation, intra-rater reliability and test-retest reliability are essentially the same estimate. ♦ Rater reliability (cont.)
  • 9.
    3. Alternate forms(Parallel forms) reliability Used to assess the consistency of the results of two tests constructed in the same way from the same content domain. Types of Reliability (cont.)
  • 10.
    Many measuring instrumentsexist in two or more versions, called equivalent, parallel or alternate form. Interchange of these alternate forms can be supported only by establishing their parallel reliability. Alternate forms reliability testing is often used as an alternative to test-retest reliability with paper- and-pencil tests, when the nature of the test is such that subjects are likely to recall their responses to test items. Example Scholastic Aptitude Test (SAT), Graduate Record Examination (GRE)
  • 11.
    Validity • The abilityof a scale to measure what it is intended to measure (the study's success at measuring whatthe researchers set out to measure) • The extent to which a measure reflects the real meaning of the concept under consideration • The extent to which a measure reflects the opinions and behaviors of the population under investigation • Can not be valid unless also reliable
  • 12.
    Validity • Depends onthe Purpose of the measure - E.g. a ruler may be a valid measuring device for length, but isn’t very valid for measuring volume • Measuring what ‘it’ is supposed to • Must be inferred from evidence; cannot be directly measured
  • 13.
    1. Face validity 2.Content validity 3. Pragmatic (criterion-related) validity A. Concurrent validity B. Predictive validity 4. Construct validity A. Convergent validity B. Discriminant validity Types of Validity
  • 14.
    1. Face Validity Indicatesthat an instrument appears to test what it is supposed to; the weakest form of measurement validity. This type of validity relies basically upon the subjective judgment of the researcher.
  • 15.
    Face Validity (cont.) Itasks two questions which the researcher must finally answer in accordance with best judgment: (1) Is the instrument measuring what it is supposed to measure? (2) Is the sample being measured adequate to be representative of the behavior or trait being measured?
  • 16.
    2. Content Validity Indicatesthat the items that make up an instrument adequately sample the universe of content that defines the variable being measured. Therefore, the instrument does contain all the elements that reflect the variable being studied. Most useful with questionnaires, examination and inventories. This type of validity is sometimes equated with face validity.
  • 17.
    Content Validity (cont.) Forexample, we are interested in the content validity of questions being asked to elicit familiarity with a certain area of knowledge, content validity would be concerned with how accurately the questions asked tend to elicit the information sought. There are no statistical indicies that can assess content validity. Claims for content validation are made by a panel of “experts” who review the instrument and determine if the questions satisfy the content domain. This process often requires several revisions of the test. When all agree that the content domain has been sampled adequately, content validity is supported.
  • 18.
    is the mostpractical and objective approach to validity testing. It is based on one test to predict results obtained on external criterion. The test to be validated, call the target test, is compared with a gold standard or criterion measure that is already established or assumed to be valid. For example, we can investigate the validity of heart rate (the target test) as an indicator of energy cost during exercise by correlating it with values obtained in standardized oxygen consumption studies (the criterion measure). 3. Criterion-Related (Pragmatic) Validity
  • 19.
    ♦ Criterion-related validityis separated into 2 components: 1. Concurrent validity 2. Predictive validity 3. Criterion-Related Validity (cont.)
  • 20.
    1. Concurrent validity isstudied when the measurement to be validated and the criterion measure are taken at relatively the same time (concurrently), so that they both reflect the same incident of behavior. Concurrent validity is also useful in situations where a new or untested tool is potentially more efficient, easier to administer, more practical, or safer than another more established method, and is being proposed as an alternative. • e.g., Does a new version of an IQ test more efficient than the past versions? 3. Criterion-Related Validity (cont.)
  • 21.
    2. Predictive validity attemptsto establish that a measure will be a valid predictor of some future criterion score. To assess predictive validity, a target test is given at one session and is followed by a period of time after which the criterion score is obtained. • e.g., Scholastic Aptitude Test (SAT) scores: Do they predict college GPA? 3. Criterion-Related Validity (cont.)
  • 22.
    4. Construct Validity reflectsthe ability of an instrument to measure an abstract concept, or construct. A construct is any concept (not real), such as honesty, which cannot be directly observed or isolated. Construct validation is interested in the degree to which the construct itself is actually measured.
  • 23.
    ♦ Assessing constructvalidity: – Convergent validity – Discriminant (Divergent) validity 4. Construct Validity (cont.) The construct validity of a test can be evaluated in terms of how its measures relate to other tests of the same and different constructs. In other words, it is important to determine what a test does measure as well as what it does not measure.
  • 24.
    ♦ Convergent validity: –Measuring the same concept with very different methods – If different methods yield the same results, then convergent validity is supported – e.g., Different survey items to used to measure decision-making style - closed and open-ended • Code for decision-making style from open-ended responses • High score on scale = more compensatory responses Convergent Validity
  • 25.
    We theorize thatall four items reflect the idea of self esteem (this is why I labeled the top part of the figure Theory). On the bottom part of the figure (Observation) we see the intercorrelations of the four scale items. This might be based on giving our scale out to a sample of respondents. You should readily see that the item intercorrelations for all item pairings are very high (remember that correlations range from -1.00 to +1.00). This provides evidence that our theory that all four items are related to the same construct is supported.
  • 26.
    ♦ Discriminant validity: indicatesthat different results, or low correlations, are expected from measures that are believed to assess different characteristics. To establish discriminant validity, you need to show that measures that should not be related are in reality not related. Ex. the results of an intelligence test should not be expected to correlate with results of a test of gross motor skill Discriminant (Divergent) Validity
  • 27.
    There are fourcorrelations between measures that reflect different constructs, and these are shown on the bottom of the figure (Observation). You should see immediately that these four cross-construct correlations are very low (i.e., near zero) and certainly much lower than the convergent correlations in the previous figure. The correlations do provide evidence that the two sets of measures are discriminated from each other.
  • 28.
    ♦ Internal – Controllingfor other factors in the design • Validity of structure, sampling, measures, procedures • Claims regarding what happened in the study ♦ External – Looking beyond the design to other cases • Validity of inferences made from the conclusions • Claims regarding what happens in the real world Validity & Research Design
  • 29.
    Threats to Validity Threatsto Internal Validity 1. History is a plausible explanation for experimental effects when some outside event that affects outcome measures occurs during the course of a study. In any study that may involve an evaluation, treatment and further evaluation, some subjects will undergo events that may affect the final evaluation. For example, in a study of the effects of an exercise programme upon hypertension, some of the patients might take up additional exercise, such as playing tennis.
  • 30.
    2. Maturation concerns notevents, but the mere passage of time, and therefore may be of particularly serious concern in health science research. Maturation refers to time- dependent internal changes in subjects. As children grow older certain skills appear and develop independent of treatment or training.
  • 31.
    3. Testing If thesame instrument is employed in pre-treatment and post-treatment evaluations, subjects are more familiar with the instrument in the post-treatment situation and usually score higher. These are sometimes called practice effects.
  • 32.
    4. Instrumentation If theinstrument of measurement is a human observer, the instrument itself may exhibit practice effects. As they gain experience, raters tend to perceive phenomena differently. Mechanistic instrumentation such as measurement scales and even physical equipment may represent instrumentation threats to validity either if the device is differentially sensitive at various levels along the range of measurement or if the equipment itself requires (and does not receive) appropriate adjustment and recalibration.
  • 33.
    5. Statistical regression (Regressionto the mean) This is a special effect that originates from the unreliability of test measures. Often, clinical research involves the selection of patients for study who have achieved particularly low or high scores on one or more measures, for example the most depressed patients or those with the highest measured cholesterol levels. If you test these people again-regardless of what you do to them in the interim-things will also have leveled off for them. Scores will move in the direction of the “average” score. This high initial scores will tend to drop, low initial scores will tend to increase, and moderate scores will tend to change very little. This happens because on the second measurement, the measurement error tends to be less.
  • 34.
    5. Statistical regression(Cont.) If subjects are chosen on the basis of very good or very poor performance on an initial assessment, they are likely to include a number of cases in which measurement error is quite high, and proportionately few with small measurement error. In such cases the regression to the mean phenomenon is a possibility, as on the second measurement there is likely to be less extreme measurement error, on average.
  • 35.
    6. Selection orassignment errors concerns the way in which subjects are placed into experimental groups. The groups being compared may, due to bad assignment or selection procedures, be different at the outset, rather than as a result of any treatment effects. This might well happen if the subjects were not randomly assigned into treatment groups. Appropriate sampling procedures help to ensure against this threat to validity.
  • 36.
    7. Mortality deals withthe loss of subjects to the research over time for various reasons. It is particularly important for the researcher to note, whenever possible, reasons for the defection of subjects and to pay especially close attention to attrition rates across experimental conditions. As the subjects who drop out might be different to those who stay, the experimental and the control group no longer remain equivalent.
  • 37.
    8. Selection-maturation threats thevarious experimental groups experience maturational change at different rates.
  • 38.
    9. Selection-history is particularlya threat when experimental groups are separate geographically in different states or even different hospitals. An event of local importance may well affect only one of your groups.
  • 39.
    10. Selection-instrumentation occur whenexperimental groups’ average scores differ on an instrument where the intervals are not equal across the entire range of the measure. An example would be a thermometer that was more sensitive below 100 degrees than above 200 degrees.
  • 40.
    Threats to ExternalValidity 1. Interaction of selection and treatment A particular threat to external validity is the use of volunteer subjects. Volunteers often have characteristics that differ from those of the general population.
  • 41.
    2. Interaction ofsetting and treatment involves the extent to which the experimental setting resembles the setting in which treatment will be received in post-experimental situations. This problem is less severe in the health sciences than in the behavioral and social sciences, where college students often serve as subjects for research intend to be generalized to industrial or other non-academic settings. The control and similarity of medical care facilities minimizes this threat for health science professionals.
  • 42.
    3. Interaction ofhistory and treatment involves differences in the ways that various treatment groups may be affected by events occurring during the course of the research. For example, the treatment might be either enhanced by an outside event while the control group remains unaffected. At this point you may be wondering how researchers ever manage to design a completely valid study. The truth is that no one ever does. There is no perfect study, only studies containing degrees of imperfection. This is why studies must be replicated in other settings, by other researchers, using other operations and methods. Good researchers are constantly aware of the pitfalls of study design presented by threats to validity and try to avoid as many of them as possible.
  • 43.
    Reliability and Validity Reliability •How accurate or consistent is the measure? • Would two people understand a question in the same way? • Would the same person give the same answers under similar circumstances? Validity • Does the concept measure what it is intended to measure? • Does the measure actually reflect the concept? • Do the findings reflect the opinions, attitudes, and behaviors of the target population?
  • 44.
    Example Suppose that youhave bathroom weight scales and these weight scales are broken. The weight scales will represent the methodology. One person weighs you with these scales and obtains a result. Then, the weight scales are passed along to another person. The second person follows the same procedure, uses the same weight scales and weighs you. The same broken weigh scales are used. The two people, using the same broken weight scales, come to similar measures. The results are reliable. The results are obtained by two (or perhaps more) people using the faulty scale. Although the results are reliable, they may not be valid. That is, by using the faulty scales, the results are not a true indicator of the real weight.
  • 46.
    Not Valid andNot Reliable
  • 48.
  • 50.
    Not Valid butReliable
  • 52.
  • 53.
    The Reliability andValidity Relationship • An instrument that is valid is always reliable • An instrument that is not valid may or may not be reliable • An instrument that is reliable may or may not be valid
  • 54.
    Bias (systematic error) -deviation from truth in one direction - a dirty dirt Random error (random variation, chance, noise) - deviation from truth in both directions i.e., above & below - mean = pop’n true value - a clean dirt - cannot be totally eliminated - estimated and reduced by statistics In most situations, there are both bias and chance. Measured value = True value + Systematic error + Random error
  • 55.
    Sources of MeasurementError 1. Individual taking the measurements (often called the tester or rater) 2. Measuring instrument 3. Variability of the characteristic being measured
  • 56.
    Sources of MeasurementError (cont.) Many sources of error can be minimized through careful planning, training, clear operational definitions and inspection of equipment. Therefore, a testing protocol should thoroughly describe the method of measurement, which must be uniformly performed across trials. Isolating and defining each element of the measure reduces the potential for error, thereby improving reliability.
  • 57.
    - Dichotomous (2x 2 table) - sensitivity, specificity, accuracy Gold Standard (Truth) +ve -ve Test +ve a b –ve c d a+c b+d Sensitivity = a / (a+c) Specificity = d / (b+d) Accuracy = (a+d) / (a+b+c+d) - Continuous 1–sample (paired) t–test 1) Statistical measures of validity:
  • 58.
    Assumptions: - Subjects areindependent - Observers are independent - Categories of (nominal, ordinal) scale are independent, mutually exclusive exhaustive 2) Statistical measures of reliability:
  • 59.
    - Nominal data Dichotomous 2 x 2 table: Kappa Polychotomous  r x r table: Overall Kappa Individual Kappa Degree of agreement: K > 0.75 : excellent agreement beyond chance 0.40  K  0.75 : fair - good K < 0.40 : poor
  • 60.
  • 63.
    Symmetric Me asures .245.134 2.498 .013 100 Kappa Measure of Agreement N of Valid Cases Value Asymp. Std. Error a Approx. T b Approx. Sig. Not assuming the null hypothesis. a. Using the asymptotic standard error assuming the null hypothesis. b. SPSS Output
  • 64.
    - Continuous data:Intraclass correlation (ICC, rI) ICC: 1.  2 observers 2. Each subject can have different # of observers 3. Be applied to ordinal data (when intervals b/t category are assumed to be equivalent)
  • 65.
  • 69.
    References 1. Department ofClinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario. Clinical Disagreement: I. How often it occurs and why. Can Med Assoc J 1980; 123: 499-504. 2. Fleiss JL, Levin B, Cho Paik M. Statistical Methods for Rates and Proportions 3rd Ed. New York : John Wiley & Sons, 2002. 3. Fletcher RH, Fletcher SW, Wagner EH. Clinical Epidemiology : the Essentials. Baltimore : Williams & Wilkins, 1988. 4. Leedy D. P. Practical Research, Planning and Design. 3rd ed. New York: Macmillan Publishing Company, 1985. 5. Maclure M, Willett WC. Misinterpretation and Misuse of the Kappa Statistic. Am J Epidemiol 1987; 126: 161-9.
  • 70.
    References (Cont.) 6. OysterC.K., Hanten W.P., Llorens L.A. Interduction to Research. Philadelphia: Lippincott, 1987. 7. Portney LG, Watkins MP. Foundations of Clinical Research: Application to Practice. Connecticut : Appleton & Lange, 1993. 8. กัลยา วานิชย์บัญชา. สถิติสาหรับงานวิจัย พิมพ์ครั้งที่ 2 กรุงเทพฯ : ภาควิชา สถิติ คณะพาณิชยศาสตร์และการบัญชี จุฬาลงกรณ์มหาวิทยาลัย, 2549 9. สังวาลย์ รักษ์เผ่า. ระเบียบวิธีวิจัยและสถิติในการวิจัยทางคลินิก เชียงใหม่: โครงการตารา คณะแพทยศาสตร์ มหาวิทยาลัยเชียงใหม่, 2539