Advantages and Disadvantages for the Existence of two Test Review Systems in the same Country.
Anders Sjöberg (Stockholm University, Sweden) anders.sjoberg@psychology.su.se
State of the art
The background and development of two test review systems, that are in use today in Sweden, are described: The National Board of Health and Welfare (NBHW) test review system and the European Federation of Psychologists' Associations (EFPA) test review system. Advantages and disadvantages are discussed from the perspective that validity is (or is not) a characteristic of a test.
New perspectives/Contributions
The development of convergent standards which are run in parallel is discussed. Different outlooks upon psychometric characteristics are outlined.
Practical implications
Effects on practice generated by the existance of multiple review systems are discussed, as well as the import of an international review system which is to coexist with a nationally developed standard.
2. Disposition
• The background and development of two test review
systems
• Advantages and disadvantages of each system is
presented from the perspectives that validity is (or is
not) a characteristic of a test.
• Example of a selection process validity study
• Questions
3. Swedish Psychological
Association
• Swedish Psychological Association is the union and
professional organization for the country’s
psychologists
• One task is to review different kinds of psychological
assessments carried out in the work and
organizational area, such as personality and
cognitive ability test used for selection and
development
4. Swedish National Board of
Health and Welfare (NBHW)
• The National Board of Health and Welfare is a
government agency in Sweden under the Ministry of
Health and Social Affairs
• One of the tasks is to review different types of
psychological assessments carried out such to
detect violence in marriage, abuse of alcohol and
other health related problems
5. SPA Review Model
Procedure
• SPA Review Model for the Description and
Evaluation of Psychological Tests is a procedure
that employs two anonymous reviewers for each
test review, with a third person to oversee the review
(Consulting Editor)
7. EFPA Review Model Sources
• British Psychological Society (BPS) Test Review
Evaluation Form
• The Spanish Questionnaire for the Evaluation of
Psychometric Tests (Spanish Psychological
Association);
• the Rating System for Test Quality produced by the
Committee on Testing of the Dutch Association of
• American Psychological Association [APA],
American Educational Research Association
[AERA], and National Council on Measurement in
Education [NCME]. US AERA/ APA/NCME .
Standards for Educational and Psychological test
8. EFPA Validity
• The framework to operationalize validity is based on
Standards for Educational and Psychological Tests
[APA], AERA], [NCME], 1954). This
conceptualization of validity holds that there are
three approaches to the validation of tests.
• Content validation (demonstration that test items are
a representative sample of the behaviors)
• Criterion-related validation (demonstration that
scores on a test are related to an outcome)
• Construct validation (collection of evidence that a
psychological concept or construct explains test
performance)
9. Practice
• EFPA Review Model for the Description and
Evaluation of Psychological Tests. Version 3.42,
(2008)
10. 2.10.1
Construct Validity - Overall Adequacy
(This overall rating is obtained by using judgment based on the ratings given for items 2.10.1.2 –
2.10.1.6. Do not simply average numbers to obtain an overall rating.)
2.10.1.2
Sample sizes:
[ -2] No information given.
[ -1] One inadequate study (e.g. sample size less than 100).
[ 0 ] One adequate study (e.g. sample size of 100-200).
[ 1 ] More than one adequate or large sized study.
[ 2 ] Good range of adequate to large studies.
2.10.1.4
Median and range of the correlations between the test and other
similar tests:
[ -2] No information given.
[ -1] Inadequate (r < 0.55).
[ 0 ] Adequate (0.55 < r < 0.65).
[ 1 ] Good (0.65 < r < 0.75).
[ 2 ] Excellent (r > 0.75)
2.10.1.5
Quality of instruments as criteria or markers:
[ -2] No information given.
[ -1] Inadequate information given.
[ 0 ] Adequate quality
[ 1 ] Good quality.
[ 2 ] Excellent quality with wide range of relevant markers for convergent and divergent validation.
11. • American Psychological Association [APA],
American Educational Research Association
[AERA], and National Council on Measurement in
Education [NCME]. US AERA/ APA/NCME .
Standards for Educational and Psychological test.
• EFPA Review Model for the Description and
Evaluation of Psychological Tests.
• Buros Center for testing
Swedish National Board of
Health and Welfare (NBHW)
12. NBHW Procedure
• NBHW test Review Model for the Description and
Evaluation of Assessment have a procedure that
employ two anonymous reviewers for each
assessment review, with one person to oversee the
review, (Consulting Editor)
13. NBHW Validity
• Validity is defined as the degree to which evidence and
theory support the interpretation of assessment scores
proposed by the service provider of the assessment.
• Instead of talking about different kinds of validity, the
service provider of the assessment must state explicitly
what interpretations are to be derived from a set of scores
and how to use these scores for decision making.
• In this way, the strength of the validity evidence refers to
the probability that the inference is correct.
• Thus, it is critical for service providers of the assessment
designing and conducting validation studies to concentrate
their efforts on ensuring evidence for the inferences they
wish to make in much the same way that they would
otherwise “defend” their conclusions in an hypothesis
testing situation.
14. Practice
• NBHW test Review Model for the Description and
Evaluation of Assessment
15. Validity
The process of validation involves accumulating evidence to provide a sound scientific basis for the
proposed score interpretations. It is the interpretations of assessment scores required by proposed
uses that are evaluated, not the assessment itself. When test scores are used or interpreted in more
than one way, each intended interpretation must be validated.
Evidence that the interpretation of the assessment score are correct.
Describe the validity studies
X Evidence that the interpretation of the results are correct is not possible to value due to lack of or
insufficient information
X Evidence that the interpretation of the results are correct, should be revised and clarified
X Evidence that the interpretation of the results are correct, should be supplemented
X Evidence that the interpretation of the results are correct is good
Justification of valuation:
Proposals
16. Validity
of a test
• Easy to evaluate
• Concentrates on
statistics
• Difficult to
evaluate
• Concentrates on
content and
evidence
Validity of
the use of a
test score
As a
reviewer
17. Validity
of a test
• Difficult to
evalute
• Concentrates on
statistics
• Easy to evaluate
• Concentrates on
content and
evidence
Validity of
the use of a
test score
As a client
18. Selection practice
• SPA model - psychometric properties of the test
• NBHW model – the selection process and decision
19. Example Selection
• Organization A use intelligence test in the selection process
(N=200)
• Organization B use intelligence test in the selection process
(N=200)
21. Results based on the validity
argument
Test score
Low High
Performance
Low
High
85
85
15
15
r = .70
22. Question and Analysis
• The relationship between the test score and the
selection decision (Not selected or Selected)
• Is the selection decision based on intelligence score
25. Conclusions
• Psychometric quality is important but not sufficient
to ensure good test use
• Both psychometric quality and practical use of the
test score should be included as criteria in the
review models
• Start to discuss the validity definition in your test-
review models
28. EFPAVersion 3.3: November 2004
• When judging overall validity, it is important to bear in
mind the importance placed on construct validity as
the best indicator of whether a test measures what it
claims to measure. In some cases, the main evidence
of this could be in the form of criterion-related studies.
Such a test might have an ‘adequate’ or better rating
for criterion-related validity and a less than adequate
one for construct validity. In general, if the evidence of
criterion-related validity or the evidence for construct
validity is at least adequate, then, by implication, the
overall rating must also be at least adequate. It
should not be regarded as an average or as the
lowest common denominator.
Editor's Notes
Hunter och Schmidt (1984) sammanfattade resultaten från 425 validitetsstudier som undersökte sambandet mellan ett begåvningstest och arbetsprestation. Totalt ingick 31 124 deltagare i studierna. Begåvning mättes med General Aptitude Test Battery. Arbetsprestation mättes genom att närmaste chef skattade den anställdes arbetsprestation. De olika typerna av yrken som ingick i studierna delades in i fem kategorier utifrån hur komplexa de var. 1 indikerade låg komplexitet (t.ex. löpande band) 5 indikerade hög komplexitet (t.ex. forskare och högre chefer) Den mittersta kategorin, kategori 3, omfattar yrken av medelkomplexitet. Här ingår 63% av alla yrken på den amerikanska arbetsmarknaden (t.ex. assistenter, administratörer och övervakning av tekniska system). Den prediktiva kraften ökar i takt med att komplexiteten i arbetsuppgifterna blir högre. En studie gjord av Hunter, Schmidt och Le 2006 bekräftar sambanden ytterligare. 2006 fanns nya statistiska metoder, t.ex. metaanalys.