2. Test Validity
The extent to which inferences made on
the basis of numerical scores are
appropriate, meaningful, and useful
Situation specific concept
Depends on purpose, population, and
environment characteristics in which test
takes place
Can be valid in one situation and not in
another.
3. Educational Inferences
Assessing achievement – how well the content of a
test represents a larger domain of content or
tasks
Construct validity – attempts to measure traits or
characteristics
Intelligence, creativity, ability, attitudes, reasoning, self-
concept
Two types of rival hypotheses
Construct underrepresentation – if the test fails to
capture important aspects of the construct
Construct irrelevant variance – extent to which a
measure includes material extraneous to the intended
construct (ex. Measuring math reasoning with story
problems – reading required)
4. Types of evidence of test validity
How well does it measure what it says it measures? -
Evidence based on test content
Are the response strategies consistent with what’s being
measured? - Testing thinking processes with multiple choice
questions
How are the items in the test related to each other? –
Evidence based on internal structure of the test related to
how well items measuring the same trait are related and
correlated
How well does one instrument correlate with another similar
instrument?
How well does the test score or measure predict
performance on a criterion measure – test criterion
relationships
5. Reliability
Refers to consistency of the measurement
The extent to which measures are free
from error
Common sources of error
Changes in time limits or directions
Different scoring procedures
Interrupted testing session
Race of test administrator
Time the test is taken
Sampling of items
Ambiguity in wording
Effect of heat, light, ventilation
Differences in observers
Fatigue
Health
Motivation
Luck
Attention
Anxiety
6. Types of Reliability
Stability –
Stable characteristics over time
Same test/same individuals over time
Equivalence –
Comparability of two measures of the same trait given at
about the same time
Different forms of test to same individuals at about the
same time
Equivalence and Stability
Compare two measures of same trait to same individuals
over time.
Internal consistency – administer one test and correlate the
items to each other (Cronbach Alpha)
Agreement – consistency of ratings or observations
7. Interpretation of Reliability
Coefficients
Acceptable range of reliability for
coefficients for most instruments between
.70 to .90
Higher Reliability
Heterogeneous group on trait measured
More items on the instrument
Greater the range of scores
Achievement tests of medium difficulty
When based on norming group – only reliable for similar
group
The more that items discriminate between high and low
achievers – the higher the reliability
8. Norm- and Criterion-Referenced
Show how individual scores compare with scores of
a well-defined reference or norm group of
individuals
Depends entirely on how the subject compare with
one another
The best distribution of scores shows a high
variance
Items must discriminate between individuals
Test items must be fairly difficult
If everyone gets a high score – there is NO
differentiation between individuals
Important content or skills may not be measured
9. Criterion-Referenced or Standards-
Based Interpretation
Normed tests
Attend carefully to characteristics of the norm or
reference group
Ceiling effect - Testing gifted group with little
variability
- Criterion-referenced tests
Score interpreted by comparing with professionally
judged standards
Standards of Proficiency – What subjects are able to do
Result in highly skewed distribution
Lessens variability
Judge master of the domain tested
10. Standardized Tests
Group Intelligence or Ability – Cognitive Abilities
Test
Individual Intelligence – Stanford-Binet/Wechsler
Multifactor – Differential Aptitude Test
Special – Torrance Test of Creative Thinking
Diagnostic – Woodcock Reading Mastery
Criterion-referenced – Writing Skills Test
Specific subjects – Modern Math Understanding
Test
Batteries – Stanford Achievement Test Series
11. Personality, Attitude, Value, and
Interest Inventories
Personality –
Rorshach Inkblot Test
Coopersmith Self-Esteem Inventory
Attitude – Minnesota School Affect
Assessment
Value – Work Values Inventory
Interest – Kuder Occupational
Interest Inventory
12. Designing a Questionnaire
Justification – generally use proven
instrument
Defining Objectives – List specific
information you hope to get
Writing Questions and Statements
Make items clear
Avoid double-barrelled questions (avoid and)
Respondents must be competent to answer (and provide reliable
information)
Questions should be relevant
Short, simple items are best
Avoid negative items
Avoid biased items or terms
13. Types of Items
Closed Form – Structured response where
subject chooses between predetermined
responses
Open Form – subject writes in any response
Scaled items
Gradations, levels, or values
Likert scale
Semantic differential scale
Ranked items
Checklist items
14. Pros and Cons of Data Collection Techniques
Paper/pencil
Economical/standard
Norms inappropriate
Must be able to read
Alternative Assessment
Holistic/authentic
Subjective rating
Costly/time-consuming
Questionnaire
Economical/easy to score
Response rate/inability to
probe and clarify
Biased/ambiguous items
Interview
Flexible/include nonverbal
responses
Costly/time consuming
Can be anonymous
Effect of interviewer and
interviewer bias
Observation
Captures natural behavior
Costly/time consuming
Observer bias
Not anonymous
15. Design a Questionnaire
Strongly
Agree
Agree Neutral Disagree Strongly
Disagree
Always Most of
the Time
Sometimes Rarely Never
Very
happy
Somewhat
happy
Neither
sad or
happy
Somewhat
sad
Very Sad
Use a Likert Scale, (or combination of Likert, ranking and differential)
Read the types of descriptors on page 262.
Like Dislike
Important Unimportant