Applied Psych Test Design: Part C - Use of Rasch scaling technology

2,915 views

Published on

The Art and Science of Applied Test Development. This is the third in a series of PPT modules explicating the development of psychological tests in the domain of cognitive ability using contemporary methods (e.g., theory-driven test specification; IRT-Rasch scaling; etc.). The presentations are intended to be conceptual and not statistical in nature. Feedback is appreciated.

Published in: Business, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,915
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
221
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • d
  • Applied Psych Test Design: Part C - Use of Rasch scaling technology

    1. 1. The Art and Science of Test Development—Part C Test and item development: Use of Rasch scaling technology Kevin S. McGrew, PhD. Educational Psychologist Research Director Woodcock-Muñoz Foundation The basic structure and content of this presentation is grounded extensively on the test development procedures developed by Dr. Richard Woodcock
    2. 2. The Art and Science of Test Development The above titled topic is presented in a series of sequential PowerPoint modules. It is strongly recommended that the modules (A-G) be viewed in sequence. Part A: Planning, development frameworks & domain/test specification blueprints Part B: Test and Item Development Part C: Use of Rasch Technology Part D: Develop norm (standardization) plan Part E: Calculate norms and derived scores Part F: Psychometric/technical and statistical analysis: Internal Part G: Psychometric/technical and statistical analysis: External The current module is designated by red bold font lettering
    3. 3. Important note: For the on-line public versions of this PPT module certain items, information, etc. is obscured for test security or proprietary reasons…sorry
    4. 4. Use Rasch (IRT) scaling to evaluate the complete pool of items and to develop the Norming and Publication tests
    5. 5. Structural (Internal) Stage of Test Development Purpose Examine the internal relations among the measures used to operationalize the theoretical construct domain (i.e., intelligence or cognitive abilities) Questions asked Do the observed measures “behave” in a manner consistent with the theoretical domain definition of intelligence? Method and concepts • Internal domain studies • Item/subscale intercorrelations • Item response theory (IRT) Characteristics of • Moderate item internal consistency strong test validity • Items/measures are representative of the empirical domain program • Items fit the theoretical structure
    6. 6. Item Scale Development via Rasch technology Gv Theoretical Domain = Cattell- Horn-Carroll (CHC) theory of cognitive abilities – Gv domain & 3 selected narrow Gv abilities Measurement or empirical domain High ability/difficult items Rasch scale and evaluate the complete pool of items to develop Norming and Publication tests Low ability/easy items
    7. 7. Recall that Block Rotation items have 2 possible correct answers. Therefore there is a scoring question: • Should items be scaled as 0/1 (need both correct to receive 1)? • Should items be scales as 0/1/2 ? Item data can be Rasch-scaled with both scoring systems and then select one that provides best reliability, etc. We decided to go with 0/1/2 scoring sytem
    8. 8. Important understanding regarding 0/1 and multiple point (0/1/2) scoring systems when using Rasch/IRT Dichotomous (0/1) item scoring 1 0 1 “step” Multiple point (0/1/2) item scoring 0 1 2 Therefore – think of 2-step items as two 0/1 items 1 “step” 1 “step”
    9. 9. Rasch IRT “norms” (calibrates) the scale ! Think of the items as now having been placed in their proper position on an equal interval ruler or yardstick….each item is a “tick” mark along the latent trait scale
    10. 10. A major advantage/feature of a large Rasch IRT-scaled item pool…….. Once you have a large Rasch IRT-scaled item pool, you can develop different and customized scales that place people on the same underlying scale • CAT (computer adaptive testing) • Different and unique forms of the test
    11. 11. A major advantage/feature of a large IRT- scaled item pool…….. Hard All three tests have items on the same scale (W- scale) Although different number of items in each test, the obtained person ability W-score ‘s are equivalent, but differ in degree of precision (reliability) Average difference in “gaps” between items on respective scales is called “item density” W-scale is equal interval metric Easy Norming test Publication Possible special test Research Edition tests
    12. 12. 2 Major Rasch results People are assigned Items are assigned W-ability scores W-difficulties Rasch puts person ability and item difficulty on the same scale (W scale)
    13. 13. 2 Major Rasch results Item Person W-ability W-difficulties scores Select and order items for Publication test based on inspection of Rasch results Block Rotation Block Rotation Norming test Publication test (n=44 items; n = 4,722 (n = 37 items; n = 4,722 norm subjects) norm subjects)
    14. 14. Block Rotation: Final Rasch with norming test n = 37 norming items n = 4722 norm subjects Measure order and fit statistics table Used to select items with specified item density
    15. 15. Block Rotation: Final Rasch Majority of with norming Block test Rotation n = 37 norming norm sample Complete items obtained W- range n = 4722 norm scores from (including subjects 480-520 extremes) of Block Distribution of Rotation W- Block Rotation scores is W-ability scores 432-546 in norm sample
    16. 16. Recall Block Rotation scoring system is 0/1/2—Items have “steps” Multiple point (0/1/2) item scoring 0 1 2 1 “step” 1 “step”
    17. 17. Block Rotation: Final Rasch with norming test n = 37 norming items n = 4722 norm subjects Item map with “steps” displayed for items Blue area represents majority of norm sample subjects Block Rotation W-scores Item 1 (0/1/2) step structure 1 “step” 1 “step”
    18. 18. Adequate “top” or “ceiling” for test scale Block Rotation: Final Rasch with norming test n = 37 norming items Excellent “bottom” or n = 4722 norm “floor” for test scale subjects Item map with “steps” displayed for items Blue area represents majority of norm sample subjects Block Rotation W-scores Very good test scale coverage for majority of population
    19. 19. Block Rotation: Final Rasch with norming test n = 37 norming items n = 4722 norm subjects Item map with “steps” displayed for items Red area represents the complete range (including extremes) of sample Block Rotation W- scores Good test scale coverage for complete range of population
    20. 20. Block Rotation Rasch floor/ceiling results confirmed by formal +-3SD floor/ceiling analysis (24-300 months of age) BLKROT: Floor (rs=1) & ceiling (rs=max) plot 550 Ref W +/- 3 SD's 510 470 430 0 10 20 30 40 50 60 70 80 90 00 10 20 30 40 50 60 70 80 90 00 10 20 30 40 50 60 70 80 90 00 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 camos
    21. 21. Block Rotation Rasch floor/ceiling results confirmed by formal +-3SD floor/ceiling analysis (300 - 1200 months of age) BLKROT: Floor (rs=1) & ceiling (rs=max) plot 550 Ref W +/- 3 SD's 510 470 43000 30 60 90 20 50 80 10 40 70 00 30 60 90 20 50 80 10 40 70 00 30 60 90 20 50 80 10 40 70 00 3 3 3 3 4 4 4 5 5 5 6 6 6 6 7 7 7 8 8 8 9 9 9 9 10 10 10 11 11 11 12 camos
    22. 22. 2 Major Rasch results Item Person W-ability W-difficulties scores Program generates final RS to W-ability scoring table Block Rotation Block Rotation Norming test Publication test (n=44 items; n = 4,722 (n = 37 items; n = 4,722 norm subjects) norm subjects)
    23. 23. Block Rotation: Final Rasch with norming test n = 37 norming items n = 4722 norm subjects Raw score to W- score “scoring table” Note: Total raw score points is 74 for 37 items. These are 2-step items. 37 items x 2 steps = 74 total possible points
    24. 24. Raw Score W-score 88 545.7 87 539.0 . . . . . . . . . . Block Rotation . . Norming Test . . . . n=44 items . . . . 44 items x 2 steps = . . raw scores from . . . . 0 to 88 on the Rasch- . . based scoring table . . (the equal interval . . Visualization-Vz . . measurement “ruler” . . or “yardstick”) . . . . 1 437.8 0 431.6 Block Rotation Norming test (n=44 items)
    25. 25. Raw Score W-score Raw Score W-score 88 545.7 74 545.7 87 539.0 73 539.0 . . . . . . . . . . . . Block Rotation . . . . Norming and . . . . Publication tests, . . . . although having . . . . different number of . . items (and total . . . . Raw Scores), are . . . . on the same . . . . underlying . . . . measurement scale . . (ruler) . . . . . . . . . . . . . . . . . . . . . . 1 437.8 1 437.8 0 431.6 0 431.6 Block Rotation Publication test n = 37 items) Block Rotation Norming test (n=44 items)
    26. 26. 2 Major Rasch results Item Person W-ability W-difficulties scores      Program      generates   final RS to        W-ability  scoring table Result: All norm subjects with Block Rotation scores (n = 4,722) now have scores on equal interval W-score Block Rotation Block Rotation Norming test Publication (n=44 items; n = 4,722 test (n = 37 items) norm subjects)
    27. 27. 2 Major Rasch results These Block Rotation W-scores are then Item used for developing Person W-ability test “norms” and W-difficulties scores completing technical manual analysis and validity research      Program      generates   final RS to W-        ability scoring  table Result: All norm subjects with Block Rotation scores (n = 4,722) now have scores on equal Block Rotation interval W-score Block Rotation Norming test Publication (n=44 items; n = 4,722 test (n = 37 items) norm subjects)
    28. 28. These Block 546 Rotation W-scores Block Rotation are then used for Summary: Final developing test Rasch for “norms” and Publication test – validity research graphic item map n = 37 norming items (0-74 RS points)  n = 4,722 norm   subjects                  Graphic display of distribution of Block Rotation person abilities Pub. Test W-score scale 432
    29. 29. Recall early warning to expect the unexpected and the non-linear “art and science” of test development Last minute question raised (prior to formal production) of Block Rotation test: Should the blocks be shaded/colored instead of being black and white? Would adding shading/color change the nature of the task? What to do? Answer: Do a study—gather some empirical data to help make decision. The question should be answered empirically – you should not assume that colorizing items will make no difference
    30. 30. Special Block Rotation no-color vs color group administration study completed
    31. 31. Special Block Rotation no-color vs color group administration study completed Sample size plan - approx 300+ subjects 3 groups spanning the complete range of Block Rotation ability • 2nd – 4th graders – approx. 100+ • 7th – 11th graders – approx 100+ • College students – approx 100+ •Final total sample was 380 subjects Group administration version of test Two forms of test constructed from complete set of ordered (scaled) items • White version – even items • Colored version – odd items Analyses – Rasch analysis and comparison of respective item difficulties and mean score comparison between versions Conclusion – adding color did NOT change the psychometric characteristics of the items/test – therefore print the final test with colored items
    32. 32. Final Block Rotation Publication Test Constructed n = 37 (0/1/2) items—Raw Scores from 0-74 Two sample items
    33. 33. Rasch (IRT) is a magnificent tool for evaluating and constructing tests with flexibilty during the entire process. Embrace IRT methods in applied test development (vs CTT methods) Important to remember you are calibrating the scale and not norming the test during this phase). Samples with rectangular distributions of ability are critical. Carefully inspect the Rasch results (esp., measure order table) and determine if you have enough easy and difficulty items or need more items at certain places along the scale. Then use “linking/anchor” technology to add in new items. Item fit is a relative matter involving “reasonably acceptable approximate fit”. Don’t blindly follow black and white item fit rules from text-books and articles. The “real world” of test development is not an ivory tower exercise. Follow 3- basic Rasch assumptions (unidimensionality; equal discrimination; local independence) “within reason” (Woodcock). Many tests claim to use the Rasch model (Rasch “name dropping”), but only use for item analyses and do not harness the advantages of the underlying Rasch ability scale (e.g., W-scale) for improved test construction and score interpretation procedures (e.g., RPI’s).
    34. 34. Maintaining a master item pool Norming-calibration tests Linking/equating (alternate forms) tests Adding new items to master item pool (use of anchor items from master item pool) Checking for possible item bias (DIF – differential item function) Creating and using shortened special purpose versions of tests (norming tests; research edition tests; tests for special populations) Flagging potentially poor examiners via empirical “person fit” statistics report Computer adaptive testing (CAT)
    35. 35. End of Part C Additional steps in test development process will be presented in subsequent modules as they are developed

    ×