The Art and Science of Test Development—Part F Psychometric/technical statistical analysis:  Internal The basic structure ...
“ In god we trust….all others must show data”  (unknown source) Test authors and publishers have  standards-based responsi...
Calculate  psychometric/measurement statistics  for technical manual/chapters <ul><ul><li>Use  Joint Test Standards  as a ...
Calculate summary statistics  (n, means, SDs, SEM) and reliabilities for all tests and clusters by technical age groups et...
Special reliability analyses  required for  speeded tests Traditional test-retest reliability analysis
Special reliability analyses for all tests More complex  repeated measures reliability analysis (McArdle and Woodcock, 198...
Provide evidence based on  internal structure  (internal validity)
Structural (Internal)  Stage of Test Development Purpose Examine the internal relations among the measures used to operati...
etc… Structural/internal validity evidence :  Test and cluster inter-correlation matrices  by technical age groups etc…
Structural/internal validity Confirmatory factor analysis  by  major age groups ( exploratory factor analysis  if  not  th...
Structural/internal validit y  Confirmatory factor analysis  by  major age groups ( exploratory factor analysis  if  not  ...
 
Structural (Internal)  Stage of Test Development Purpose Examine the internal relations among the measures used to operati...
The WJ III factor structure model  provided the best fit  to the data when compared to  six alternative models The conclus...
Internal validity evidence example:  g-loadings  for differentially weighted  General Intellectual Ability  cluster
Provide evidence based on  internal structure:  Developmental evidence?
Developmental evidence  in the form of  differential  growth curves of measures
Provide  Test Fairness  Evidence
= White Non-White Structural/internal validity Evaluating structural invariance with Multiple Group CFA
= Male Female Structural/internal validity Evaluating structural invariance with Multiple Group CFA
= Hispanic Non-Hispanic Structural/internal validity Evaluating structural invariance with Multiple Group CFA
Test fairness evidence:  Item Level Analyses:  Differential Item Functioning (DIF) <ul><ul><li>Male/Female </li></ul></ul>...
<ul><ul><li>Male/Female </li></ul></ul><ul><ul><li>White/Non-White </li></ul></ul><ul><ul><li>Hispanic/Non- </li></ul></ul...
<ul><li>Lack of rigor and quality control  in all prior/earlier stages will  “rattle through the data”  and rear its ugly ...
<ul><li>Don’t be  seduced and completely reliant on factor analysis  as the primary internal/structural validity tool </li...
<ul><li>Exploratory-driven confirmatory factor analysis  is often used by test developers to explore unexpected characteri...
End of Part F  Additional steps in test development process will be presented in subsequent modules as they are developed
Upcoming SlideShare
Loading in …5
×

Applied Psych Test Design Part F: Psychometric/technical statistical analysis: Internal

1,001
-1

Published on

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,001
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
49
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • d
  • NOTE: Reliability statistics calculated for all WJ III tests across their range of intended use. Included all norming subjects tested at each technical age level. (25 levels - by year 2-19, by 10 yr.intervals 20-79, 80+) Split-half coefficients corrected for length using Spearman-Brown correction Rasch analysis procedures used for calculating reliability for 8 multi-point tests (4 in COG, 4 in ACH) and 8 speeded (5 in COG, 3 in ACH) Test-retest study also conducted for 8 speeded tests Interval set at 1 day to minimize changes in test scores due to changes in the subjects’ states or traits. These reliabilities are lower than those for same tests obtained using Rasch analysis procedures. Test-retest is lower bound for speeded tests - other reliabilities are upper bound. TESTS: 38 of 42 are .80 or higher (15 are .90 or higher) (4 are in .70s) CLUSTERS: 36 of 42 are .90 or higher (6 are in .80s) Consult: Chapter 3 of Technical Manual for more information. Summary information in ASB 2.
  • Applied Psych Test Design Part F: Psychometric/technical statistical analysis: Internal

    1. 1. The Art and Science of Test Development—Part F Psychometric/technical statistical analysis: Internal The basic structure and content of this presentation is grounded extensively on the test development procedures developed by Dr. Richard Woodcock Kevin S. McGrew, PhD. Educational Psychologist Research Director Woodcock-Muñoz Foundation
    2. 2. “ In god we trust….all others must show data” (unknown source) Test authors and publishers have standards-based responsibility to provide supporting psychometric technical information re: tests and battery Typically in the form of a series of technical chapters in manual or a separate technical manual
    3. 3. Calculate psychometric/measurement statistics for technical manual/chapters <ul><ul><li>Use Joint Test Standards as a guide </li></ul></ul>
    4. 4. Calculate summary statistics (n, means, SDs, SEM) and reliabilities for all tests and clusters by technical age groups etc… etc…
    5. 5. Special reliability analyses required for speeded tests Traditional test-retest reliability analysis
    6. 6. Special reliability analyses for all tests More complex repeated measures reliability analysis (McArdle and Woodcock, 1989—see WJ-R Technical Manual)
    7. 7. Provide evidence based on internal structure (internal validity)
    8. 8. Structural (Internal) Stage of Test Development Purpose Examine the internal relations among the measures used to operationalize the theoretical construct domain (i.e., intelligence or cognitive abilities ) Questions asked Do the observed measures “behave” in a manner consistent with the theoretical domain definition of intelligence? Method and concepts <ul><li>Internal domain studies </li></ul><ul><li>Item/subscale intercorrelations </li></ul><ul><li>Exploratory/confirmatory factor analysis </li></ul>Characteristics of strong test validity program <ul><li>Measures co-vary in a manner consistent with the intended theoretical structure </li></ul><ul><li>Factors reflect trait rather than method variance </li></ul><ul><li>Items/measures are representative of the empirical domain </li></ul>
    9. 9. etc… Structural/internal validity evidence : Test and cluster inter-correlation matrices by technical age groups etc…
    10. 10. Structural/internal validity Confirmatory factor analysis by major age groups ( exploratory factor analysis if not theory-driven test blueprint)
    11. 11. Structural/internal validit y Confirmatory factor analysis by major age groups ( exploratory factor analysis if not theory-driven test blueprint) .67 .53 .40 .42 .43
    12. 13. Structural (Internal) Stage of Test Development Purpose Examine the internal relations among the measures used to operationalize the theoretical construct domain (i.e., intelligence or cognitive abilities ) Questions asked Do the observed measures “behave” in a manner consistent with the theoretical domain definition of intelligence? Method and concepts <ul><li>Exploratory/confirmatory factor analysis </li></ul>Characteristics of strong test validity program <ul><li>The theoretical/empirical model is deemed plausible ( especially when compared against other competing models ) based on substantive and statistical criteria </li></ul>
    13. 14. The WJ III factor structure model provided the best fit to the data when compared to six alternative models The conclusion was the same across 5 age-differentiated samples Structural/internal validity: Confirmatory factor analysis model comparisons by major age groups
    14. 15. Internal validity evidence example: g-loadings for differentially weighted General Intellectual Ability cluster
    15. 16. Provide evidence based on internal structure: Developmental evidence?
    16. 17. Developmental evidence in the form of differential growth curves of measures
    17. 18. Provide Test Fairness Evidence
    18. 19. = White Non-White Structural/internal validity Evaluating structural invariance with Multiple Group CFA
    19. 20. = Male Female Structural/internal validity Evaluating structural invariance with Multiple Group CFA
    20. 21. = Hispanic Non-Hispanic Structural/internal validity Evaluating structural invariance with Multiple Group CFA
    21. 22. Test fairness evidence: Item Level Analyses: Differential Item Functioning (DIF) <ul><ul><li>Male/Female </li></ul></ul><ul><ul><li>White/Non-White </li></ul></ul><ul><ul><li>Hispanic/Non- </li></ul></ul><ul><ul><li>Hispanic </li></ul></ul>
    22. 23. <ul><ul><li>Male/Female </li></ul></ul><ul><ul><li>White/Non-White </li></ul></ul><ul><ul><li>Hispanic/Non- </li></ul></ul><ul><ul><li>Hispanic </li></ul></ul><ul><li>Results combined with results from Bias Sensitivity Review Panels </li></ul>Test fairness evidence: Item Level Analyses: Differential Item Functioning (DIF)
    23. 24. <ul><li>Lack of rigor and quality control in all prior/earlier stages will “rattle through the data” and rear its ugly head when performing the final statistical analysis </li></ul><ul><li>Shorts cuts in prior stages will “bite you in in the ____” as you attempt to perform final statistical analysis </li></ul><ul><li>Data screening, data screening, data screening !!!!……. prior to do performing final statistical analysis </li></ul><ul><ul><li>Compute extensive descriptive statistical analysis for all variables (e.g., histograms, scatterplots, box-whisker plots, etc.) </li></ul></ul><ul><ul><li>More than means and SD’s. Also calculate median, skew, kurtosis, n-tiles, etc. </li></ul></ul><ul><li>Deliberately planned and sophisticated “ front end” data collection short-cuts (e.g., matrix sampling) introduce an extreme level of “back end” complexity to routine statistical/psychometric analysis </li></ul><ul><li>Know your limits, level of expertise, and skills . Even those with extensive test development experience often need access to trusted measurement/statistical consultants </li></ul>(cont. next slide)
    24. 25. <ul><li>Don’t be seduced and completely reliant on factor analysis as the primary internal/structural validity tool </li></ul><ul><ul><li>An example : Inability of CFA to differentiate closely related latent constructs (e.g., Gc and Reading/Writing—Grw) doesn’t prove they are the same. Need to examine other evidence (e.g., very different developmental growth curves for Gc and Grw) </li></ul></ul><ul><li>Published statistics/psychometric information needs to be based on final publication length tests </li></ul><ul><ul><li>Often need to use test-length correction formula’s (e.g., KR-21) for test reliabilities </li></ul></ul><ul><ul><li>Correlations between short /and or long norming versions of a test, that differ in test length (number of items) from publication length test , may need special adjustments/corrections. </li></ul></ul><ul><li>Back up, back up, back up!!!!!!!!!! Don’t let a dead hard drive or computer destroy your work and progress. Do it constantly. Build redundancy into your files and people skill sets </li></ul><ul><li>Sad fact : Majority of test users do NOT pay attention to the fancy and special psychometric/statistical analysis you report in technical chapters or manuals. Be prepared for post-publication education via other methods. </li></ul><ul><li>Post-manual publication technical reports of special/sophisticated analyses are good when publication time-line pressures dictate making difficult decisions. </li></ul>
    25. 26. <ul><li>Exploratory-driven confirmatory factor analysis is often used by test developers to explore unexpected characteristics of tests (often called “model generation modeling” in SEM/CFA literature) </li></ul><ul><li>Different approaches to DIF (differential item functioning) </li></ul><ul><li>Multiple group CFA to test invariance (by age, by gender, by……..) </li></ul><ul><ul><li>Different degrees of measurement invariance can be tested </li></ul></ul><ul><li>Traditional definition of psychometric bias and appropriate/inappropriate statistical methods </li></ul><ul><li>Equating (e.g., Form A/B) methods and evidence </li></ul><ul><li>Methods for calculating prediction models that account for regression to the mean and that are sensitive to developmental (age) X content interactions </li></ul><ul><li>Complex repeated measures reliability analyses to tease out test stability , internal consistency, and trait stability sources of score variance (see WJ-R Technical Manual) </li></ul>
    26. 27. End of Part F Additional steps in test development process will be presented in subsequent modules as they are developed

    ×