3. Outline of Chapter 8 and 9
• Main Points in Chapter 8 and 9 with
immediate Examples.
• Expansion, themes and explanation
• Questions & Related Videos
3/23/2020 3
4. VALIDITY & RELIABILITY
• Validity refers to the appropriateness, meaningfulness,
correctness, and usefulness of the inferences which a
researcher makes.
• An example, a test of intelligence should measure
intelligence and not something else (such as memory)
• Reliability refers to the consistency of scores or
answers from one administration of an instrument to
another.
• An example, measurements of people's height and
weight are often extremely reliable.
3/23/2020 4
5. 3/23/2020 5
Validity: determines whether the
research truly measures that
which it was intended to or how
truthful the research results are.
Reliability and validity are
conceptualized as trustworthiness,
rigor and quality in qualitative
paradigm. That can be achieved
by eliminating bias and increasing
the researcher’s trustfulness of a
proposition about some social
phenomena using triangulation.
7. It is the most important idea
to consider when preparing
or selecting an instrument.
Validation is the process of
collecting and analyzing
evidences to support such
inferences.
The validity of a
measurement tool is the
degree to which the tool
measures what it claims to
measure.
3/23/2020 7
8. The Importance of Valid Instruments
• The quality of instruments used in
research is very important because
conclusions drawn are based on
the information obtained by these
instruments.
• Researchers follow certain
procedures to make sure that the
inferences they draw based on the
data collected are valid and
reliable.
• Researchers should keep in mind
these two terms, validity and
reliability, in preparing data
collection instruments.3/23/2020 8
9. How can validity be established?
• Quantitative studies:
– measurements, scores, instruments
used, research design.
An example: The survey conducted to
understand the amount of time a doctor
takes to tend to a patient when the patient
walks into the hospital
• Qualitative studies:
– ways that researchers have devised to
establish credibility: member checking,
triangulation, thick description, peer
reviews, external audits.
– Examples: diary accounts, open-ended
questionnaires, documents, participant
observation and ethnography.
3/23/2020 9
10. Evidence of Validitity
There are 3 types of evidence a researcher might
collect:
Content-related evidence of validity
Content and format of the instrument
An example: Is the test fully representative of
what it aims to measure?
Criterion-related evidence of validity
Relationship between scores obtained using
the instrument and scores obtained
An example: Do the results correspond to a
different test of the same thing?
Construct-related evidence of validity
Psychological construct being measured by
the instrument
An example: Does the test measure the
concept that it’s intended to measure?
10
11. A key element is the adequacy of the
sampling of the domain that is supposed to
represent.
The other aspect of content validation is the
format of the instrument.
Attempts to obtain evidence that the items
measure what they are supposed to measure
typify the process of content-related evidence.
Content-related Evidence
3/23/2020 11
12. • Content-related evidence of validity:
Content and format of the instrument
• Examples:
• How appropriate is the content?
• How comprehensive is the
content?
• Does the content get at the
intended variable?
• How adequately does the
sample of items or questions
represent the content to be
assessed?
• Is the format of the instrument
appropriate?
3/23/2020 12
13. • Content-related evidence, for example:
• The effects of a new listening program on speaking
ability of fifth-graders:
• Adequacy of sampling: The nature of the
psychological construct or characteristic
being measured by the instrument.
• Format of the instrument: The clarity of printing,
size of type, adequacy of work space,
appropriateness of language, clarity of
directions etc.
3/23/2020 13
14. – How can we obtain content-related evidence of validity?
• Have someone who knows enough about what is being
measured to be a competent judge. In other words, the
researcher should get more than one judge’s opinions
about the content and format of the instrument to be
applied before administering it.
• The researcher evaluates the feedback from the judges
and makes necessary modifications in the instrument.
3/23/2020 14
15. How to establish Content
Validity?
1) Instructional objectives.
An example:
• At the end of the chapter, the student will be able
to do the following:
1. Explain what ‘stars’ are
2. Discuss the type of stars and galaxies in our universe
3. Categorize different constellations by looking at the stars
4. Differentiate between our stars, the sun, and all other
stars
3/23/2020 15
16. 2) Table of Specification.
(An example):
3/23/2020 16
17. A criterion is a second test presumed
to measure the same variable.
Criterion-related evidence of validity: The
relationship between the scores obtained
by an instrument and the scores obtained
by another instrument (a criterion).
• Examples:
How strong is this relationship?
How well do such scores estimate
the present or future performance of
a certain type?
Criterion-related Evidence
3/23/2020 17
18. • When a correlation coefficient is used to describe the relationship
between a set of scores obtained by the same group of
individuals on a particular instrument (the predictor) and their
scores on some criterion measure (the criterion), it is called a
validity coefficient.
• Expectancy table: (See Table 8.1, p. 164)
– Criterion-related evidence:
• Compare performance on one instrument with performance
on some other, independent criterion. Academic ability
scores of the students on the instrument compared with
students’ grade point averages. High scores on the
instrument will correspond to high grade point averages.
3/23/2020 18
19. – Two forms of criterion-related validity:
1) Predictive validity: Student scores on a science
aptitude test administered at the beginning of the
semester are compared with the end-of-the-semester
grades.
• An example: the validity of a cognitive test for job
performance is the correlation between test scores
and, for example, supervisor performance ratings.
2) Concurrent validity: Instrument data and criterion data
are collected at nearly the same times and the results are
compared to obtain evidence of concurrent validity.
• An example: Researchers give a group of students a
new test, designed to measure mathematical aptitude.
They then compare this with the test scores already
held by the school, a recognized and reliable judge of
mathematical ability.
3/23/2020 19
20. Considered the broadest of the three categories.
There is no single piece of evidence that satisfies
construct-related validity.
Researchers attempt to collect a variety of types
of evidence, including both content-related and
criterion-related evidence.
The more evidence researchers have from
different sources, the more confident they become
about the interpretation of the instrument.
Construct-related Evidence
3/23/2020 20
21. The nature of the psychological construct or characteristic being
measured by the instrument.
Examples:
• How well does a measure of the construct explain the
differences in the behavior of individuals or their performance
on certain tasks?
• A women’s studies program may design a cumulative
assessment of learning throughout the major. The questions
are written with complicated wording and phrasing. This can
cause the test inadvertently becoming a test of reading
comprehension, rather than a test of women’s studies. It is
important that the measure is actually assessing the intended
construct, rather than an extraneous factor.
3/23/2020 21
22. In obtaining construct-related evidence
of validity, there are three steps
involved.
1. the variable being measured is clearly defined.
2. hypotheses, based on a theory underlying the
variable, are formed about how people who
possess a lot versus a little of the variable will
behave in a particular situation.
3. the hypotheses are tested both logically and
empirically.
3/23/2020 22
23. Construct-related Evidence
– Does the test measure the ‘human’ characteristic(s)?
it is supposed to measure:
– Verbal reasoning
– Mathematical reasoning
– Musical ability
– Spatial ability
– Mechanical aptitude
– Motivation
*Applicable to authentic assessment, each construct is broken
down into its component parts
An example: ‘motivation’ can be broken down to:
– Interest
– Attention span
– Hours spent
– Assignments undertaken and submitted, etc.
All of these sub-constructs put together – measure ‘motivation’
3/23/2020 23
24. Factors that can lower Validity
• Unclear directions
• Difficult reading vocabulary and sentence structure
• Ambiguity in statements
• Inadequate time limits
• Inappropriate level of difficulty
• Poorly constructed test items
• Test items inappropriate for the outcomes being
measured
• Tests that are too short
• Improper arrangement of items (complex to easy?)
• Identifiable patterns of answers
• Nature of criterion
3/23/2020 24
26. RELIABILITY
• Reliability refers to the consistency
of scores obtained from one
administration of an instrument to
another and from one set of items to
another.
• If the test is reliable, we would expect
a student who receives a high score
on the test for measuring typing ability
at first instance, to receive a high
score the next time he takes the test.
The scores may not be identical, but
they should be close.
3/23/2020 26
27. How can reliability be established?
• Quantitative studies?
– Assumption of repeatability
• Qualitative studies?
– Reframe as dependability and
confirmability
3/23/2020 27
28. ERRORS OF MEASUREMENT
• Whenever people take the same test twice,
they will seldom perform exactly the same,
that is, their scores or answers will not be
identical. It is inevitable due to a variety of
factors such as motivation, energy, anxiety,
a different testing situation etc. Such factors
result in errors of measurement.
3/23/2020 28
29. The scores obtained
from an instrument can
be quite reliable, but
not valid. The test on
the Constitution of the
US versus success in
the physical education.
If the data are
unreliable, they cannot
lead to valid
inferences.
(See Figure 8.2, p.166)
3/23/2020 29
30. Validity and Reliability
Neither Valid
nor Reliable
Reliable
but not
Valid
Valid &
Reliable
Fairly Valid but
not very
Reliable
Think in terms of ‘the
purpose of tests’ and
the ‘consistency’ with
which the purpose is
fulfilled/met
32. Reliability Coefficient
• Reliability Coefficient can be measured by
three best known ways:
• Test – retest
Give the same test twice to the same group with any time
interval between tests
• Equivalent forms (similar in content, difficulty level,
arrangement, type of assessment, etc.)
Give two forms of the test to the same group in close
succession.
• The internal Consistency methods (subjective scoring)
Calculate percent of exact agreement by using Pearson's
product moment and find out the coefficient of
determination (SPSS)
3/23/2020 32
33. • The test-retest method involves administering the
same test to the same group after a certain time has
elapsed. A reliability coefficient is then calculated to
indicate the relationship between the two sets of
scores obtained. For most educational research,
stability of scores over a two- to three-month period is
usually viewed as sufficient evidence of test-retest
reliability.
• An example: test on a Monday, then again the
following Monday.
3/23/2020 33
34. Validity & Reliability Coefficients
Equivalent-forms method
• When the equivalent-forms method is used, two
different but equivalent (parallel or alternate) forms of
an instrument are administered to the same group of
individuals during the same time period. A high
coefficient would indicate strong evidence of reliability
– that two forms are measuring the same thing.
• An example: uses one set of questions divided into
two equivalent sets (“forms”), where both sets
contain questions that measure the same construct,
knowledge or skill
3/23/2020 34
35. INTERNAL CONSISTENCY METHODS
• The methods we have seen so far require two
administrations or testing sessions. There are several
internal-consistency methods of estimating reliability that
require only a single administration of an instrument.
» Split-half procedure: This procedure involves scoring two halves
(odd items versus even items) of a test separately for each person
and then calculating a correlation coefficient for the two sets of
scores. The coefficient indicates the degree to which two halves
of the test provide the same results and hence describes the
internal consistency of the test.
An example: If we are interested in the perceived practicality of
electric cars and gasoline-powdered cars, we could use a split-
half method and ask the "same" question two different ways.
3/23/2020 35
36. INTERNAL CONSISTENCY METHODS
Kuder-Richardson Approaches: The most frequently used method for
determining internal consistency is the Kuder-Richardson approach,
particularly formulas KR20 and KR21.
An example: reliability for a binary test (i.e. one with right or wrong
answers).
KR21 requires (1) the number of items on the test, (2) the mean, and (3)
the standard deviation if the items on the test are of equal difficulty. (See
KR21 formula in P.167)
Alpha Coefficient – Another check on the internal consistency of an
instrument is to calculate an alpha coefficient frequently called
Cronbach alpha symbolized as ().
An example: measures internal reliability for tests with multiple possible
answers.
3/23/2020 36
37. INTERNAL CONSISTENCY
METHODS
• The standard error of measurement
(SEM) is a measure of how much
measured test scores are spread
around a “true” score. It’s an index
shows the extent to which a
measurement would vary under
changed circumstances (the amount of
measurement error)
3/23/2020 37
38. Validity & Reliability Coefficients
• A validity coefficient expresses the
relationship between scores of the same
individuals on two different
instruments.
• A reliability coefficient expresses the
relationship between the scores of the
same individuals on the same
instrument at two different times or
between two parts of the same
instrument.
• Reliability coefficients must range from
.00 to 1.00, that is, with no negative
values.3/23/2020 38
39. Conclusion
• Reliability and validity are concepts used to
evaluate the quality of research. They indicate
how well a method, technique measures
something. Reliability is about the consistency of
a measure, and validity is about the accuracy of a
measure.
• It’s important to consider reliability and validity
when you are creating your research design,
planning your methods, and writing up your
results, especially in quantitative research.
3/23/2020 39
41. 3/23/2020 41
“Well, I have been drinking more lately,
and I’m on a new diet. Also my wife and I
haven’t been getting along of late.”
“You think that the increase in your
blood pressure is due to the new class
you’ve been assigned. Is anything else
different?”
M.D. Teacher
42. INTERNAL VALIDITY
• When a study has internal validity, it means that any
relationship observed between two or more variables
should be unambiguous as to what it means rather than
being due to “something else” – (alternative hypothesis).
• Defending against sources of bias arising in research
design.
• An example: let’s suppose you ran an experiment to see if
mice lost weight when they exercised on a wheel. You
used good experimental practices, like random samples,
and you used control variables to account for other things
that might cause weight loss (change in diet, disease, age
etc.). In other words, you accounted for the confounding
variables that might affect your data and your experiment
has high validity.
3/23/2020 42
43. Threats to Internal Validity
• Subject characteristics
• Mortality
• Location
• Instrumentation
• Testing
• History
• Maturation
• Attitude of Subjects
• Regression
• Implementation
3/23/2020 43
44. Subject characteristics
Participants in a study may have different characteristics and
those differences may affect the results.
Examples:
Age Intelligence Vocabulary
Strength Attitude Fluency
Maturity Fluency Ethnicity
Coordination Speed
Socioeconomic status
3/23/2020 44
45. Mortality
Loss of subjects from the study due to:
Examples:
– Illness
– Family relocation
– Requirements of other activities
– Absent during collection of data or fail to
complete tests
3/23/2020 45
46. Location
Place in which data is collected, or
an intervention is carried out, may
influence the results.
An example: Studying the behavior of
animals in a zoo may make it easier
to draw valid causal inferences within
that context, but these inferences
may not generalize to the behavior of
animals in the wild.
3/23/2020 46
47. Instrumentation
Inconsistent use of the measurement instrument.
1. Instrument Decay – instrument changes in some way
2. Data Collector Characteristics – affects the nature of the
data they obtain. Such as gender, age, ethnicity, age,
language patterns, etc.
3. Data Collector Bias – unconsciously influence the
outcome of the data.
An example: Two examiners for an instructional experiment
administered the post-test with different instructions and
procedures.
3/23/2020 47
48. Testing
• In an experiment in which performance on a logical
reasoning test is the dependent variable, a pre-test cues the
subjects about the post-test.
• So pre-tests may influence the result of the post-test.
• A group may perform better on a post-test because the pre-
test primed them to perform better.
• An example: Participants may remember the correct
answers or may be conditioned to know that they are being
tested. Repeatedly taking (the same or similar) intelligence
tests usually leads to score gains.
3/23/2020 48
49. History
Occurrence of events that could alter the outcome or
the results of the study
Previous – events that occurred prior to the study
Concurrent – happening during the study.
An example: What if the children in one group differ
from those in the other in their television habits.
Perhaps the experimental group children watch a
certain program more frequently than those in the
control group do. There’s really an effect of the two
groups differentially experiencing a relevant event
in this case -certain program- between the
pretest and posttest.
3/23/2020 49
50. Maturation
Any changes that occur in the subjects during the
course of the study that are not part of the study
and that might affect the results of the study; such
as changes due to aging and experience.
An example: The sample can go from being in a
good mood or a bad one. Factors such as subject
tiredness, boredom, hunger and inattention can
also occur. These factors can be driven by the
research participant or the experiment.
3/23/2020 50
51. Attitude of Subjects
• Subjects opinion and participation can influence the
outcome.
• Observing or studying subjects can affect their
responses a.k.a. Hawthorne effect.
• Subjects receiving experimental treatment may
perform better due to “receiving” treatment.
• Subjects in the control group may perform more
poorly than the treatment group.
• An example: The students may feel that they are
being tested too often. This feeling could cause them
to get tired of having to take another test and not do
their best. This change in attitude could cause the
results of the study to be biased.
3/23/2020 51
52. Regression
Groups that are chosen due to performance
characteristics, either high or low, will on the
average score closer to the mean on subsequent
testing regardless of what transpires during the
experiment.
An example: A group of students who score low on
a mathematics test are given additional help. Six
weeks later they are given an exam similar
problems on another test, their average score has
improved. Is it due to the additional help or other
influences?
3/23/2020 52
53. Implementation
Personal bias in favor of one method or another. Preference
for the method may account for better performance by the
subjects; how well you implement your methodology whether it
has measured what it your research question has intended to
or not.
An example: If you were interested in the impact of two
different teaching methods, namely students receiving lectures
and seminar classes compared to students receiving lectures
only on the exam performance of students, you may also want
to ensure that the teachers involved in the study had
a similar educational background, teaching experience , and
so forth.
3/23/2020 53
54. Methods to Minimize Threats
Standardization of the conditions under which the
research study is carried out will help minimize
threats from history and instrumentation.
Obtain as much information as possible about the
participants in the research study minimizes
threats from mortality and selection.
Obtain as much information as possible about the
procedural details of the research study, for
example, where and when the study occurs,
minimizing threats from history and
instrumentation.
Choose an appropriate research design which can
help control most other threats.
3/23/2020 54
55. An Example: Is the test appropriate to the
population?
• What is the composition of the test-taking population?
• To what extent can the assessment be administered without
encumbrance to all members of the population?
• Is there a translated version, adapted version or accommodated
version of the test?
• Are there recommendations for alternative testing procedures?
• Has the planned accommodation been assessed in terms of it’s
impact on validity and reliability of test scores?
3/23/2020 55
56. Conclusion
• Internal validity refers to how well a piece of research allows
you to choose among alternate explanations of something. A
research study with high internal validity lets you choose one
explanation over another with a lot of confidence, because it
avoids (many possible) confounds.
• The more valid and reliable research instruments are the more
likely one will draw the appropriate conclusions from the
collected data and solve the research problem in a creditable
fashion.
3/23/2020 56
57. how do I go about
establishing /
ensuring validity
& reliability in my
own test papers?
3/23/2020 57
58. What do you think…?
• Forced-choice assessment forms are high in reliability, but
weak in validity (true/false)
• Performance-based assessment forms are high in both
validity and reliability (true/false)
• A test item is said to be unreliable when most students
answered the item wrongly (true/false)
• When a test contains items that do not represent the
content covered during instruction, it is known as an
unreliable test (true/false)
• Test items that do not successfully measure the intended
learning outcomes (objectives) are invalid items
(true/false)
• Assessment that does not represent student learning well
enough are definitely invalid and unreliable (true/false)
• A valid test can sometimes be unreliable (true/false)
– If a test is valid, it is reliable! (by-product)
3/23/2020 58
62. Bibliography
Fraenkel, J.R. & Wallen, N.E. (2002). Validity & Reliability,
Internal Validity. How to design and evaluate research in
education( fifth edition).
Martin, Wendy (1997). Single Group Threats to Internal
Validity. Retrieved October 15, 2006 from the World
Wide Web:
http://www.socialresearchmethods.net/tutorial/Martin/intva
l1.html.
https://www.youtube.com/watch?v=KuT2n1w0Ixc
https://www.youtube.com/watch?v=F6LGa8jsdjo
https://www.youtube.com/watch?v=2fK1ClycBTM
3/23/2020 62