Advertisement
Advertisement

More Related Content

Advertisement
Advertisement

Introduction to Item Response Theory

  1. Day 1 AM: An Introduction to Item Response Theory Nathan A. Thompson Vice President, Assessment Systems Corporation Adjunct faculty, University of Cincinnati nthompson@assess.com
  2. Welcome!  Thank you for attending!  Introductions and important info now  Software… download or USB  Please ask questions ◦ Also, slow me down or ask for translation!  Goal: provide an intro on IRT/CAT to those who are new ◦ For those with some experience, to provide new viewpoints and more resources/recommendations
  3. Where I’m from, professionally  PhD, University of Minnesota ◦ CAT for classifications  Test development manager for ophthalmology certifications  Psychometrician at Prometric (many certifications)  VP at ASC
  4. Where I’m from, geographically
  5. Except now things look like…
  6. We do odd things in winter
  7. Introduce yourselves  Name  Employer/organization  Types of tests you do and/or why you are interested in IRT/CAT  (There might be someone with similar interests here)
  8. Another announcement  Newly formed: International Association for Computerized Adaptive Testing (IACAT) ◦ www.iacat.org ◦ Free membership ◦ Growing resources ◦ Next conference: August 2012, Sydney
  9. Welcome!  This workshop is on two highly related topics: IRT and CAT  IRT is the modern paradigm for developing, analyzing, scoring, and linking tests  CAT is a next-generation method of delivering tests  CAT is not feasible without IRT, so we do IRT first
  10. IRT – where are we going?  IRT, as many of you know, provides a way of analyzing items  However, it has drawbacks (no distractor analysis), so the main reasons to use IRT are at the test level  It solves certain issues with classical test theory (CTT)  But the two should always be used together
  11. IRT – where are we going?  Advantages ◦ Better error characterization ◦ More precise scores ◦ Better linking ◦ Model-based ◦ Items and people on same scale (CAT) ◦ Sample-independence ◦ Powerful test assembly
  12. IRT – where are we going?  Keyword: paradigm or approach ◦ Not just another statistical analysis ◦ It is a different way of thinking about how tests should work, and how we can approach specific problems (scaling, equating, test assembly) from that viewpoint
  13. Day 1  There will be four parts this morning, covering the theory behind IRT: ◦ Rationale: A graphical introduction to IRT ◦ Models (dichotomous and polytomous) and their response functions ◦ IRT scoring (θ estimation) ◦ Item parameter estimation and model fit
  14. Part 1 A graphical introduction to IRT
  15. What is IRT?  Basic Assumptions 1. Unidimensionality  A unidimensional latent trait (1 at a time)  Item responses are independent of each other (local independence), except for the trait/ability that they measure 2. A specific form of the relationship between trait level and probability of a response  The response function, or IRT model  There are a growing number of models
  16. What is IRT?  A theory of mathematical functions that model the responses of examinees to test items/questions  These functions are item response functions (IRFs)  Historically, it has also been known as latent trait theory and item characteristic curve theory  The IRFs are best described by showing how the concept is derived from classical analysis…
  17. Classical item statistics  CTT statistics are typically calculated for each option Option N Prop Rpbis Mean 1 307 0.860 0.221 91.876 2 25 0.070 -0.142 85.600 3 14 0.039 -0.137 83.929 4 11 0.031 -0.081 86.273
  18. Classical item statistics  The proportions are often translated to a figure like this, where examinees are split into groups
  19. Classical item statistics  The general idea of IRT is to split the previous graph up into more groups, and then find a mathematical model for the blue line  This is what makes the item response function (IRF)
  20. Classical item statistics  Example with 10 groups
  21. The item response function  Reflects the probability of a given response as a function of the latent trait (z-score)  Example:
  22. The IRF  For dichotomously scored items, it is the probability of a correct or keyed response  Also called Item Characteristic Curve (ICC) or Trace Line  Only one curve (correct response), and all other responses are grouped as (1-IRF)  For polytomous items (partial credit, etc.), it is the probability of each response
  23. The IRF  How do we know exactly what the IRF for an item is?  We estimate parameters for an equation that draws the curve  For dichotomous IRT, there are three relevant parameters: a, b, and c
  24. The IRF  a: The discrimination parameter; represents how well the item differentiates examinees; slope of the curve at its center  b: The difficulty parameter; represents how easy or hard the item is with respect to examinees; location of the curve (left to right)  c: The pseudoguessing parameter; represents the ‘base probability’ of answering the question; lower asymptote
  25. The IRF  a=1, b=0, c=0.25
  26. The IRF…  is the “basic building block” of IRT  will differ from item to item  can be one of several different models (now)  can be used to evaluate items (now)  is used for IRT scoring (next)  leads to “information” used for test design (after that)  is the basis of CAT (tomorrow)
  27. Part 2 IRT models
  28. IRT models  Several families of models ◦ Dichotomous ◦ Polytomous ◦ Multidimensional ◦ Facets (scenarios vs raters) ◦ Mixed (additional parameters) ◦ Cognitive diagnostic ◦ We will focus on first two
  29. Dichotomous IRT models  There are 3 main models in use, as mentioned earlier: 1PL, 2PL, 3PL  The “L” refers to “logistic”: which is the type of equation  IRT was originally developed decades ago with a cumulative normal curve  This means that calculus needed to be used
  30. The logistic function  An approximation was developed: the logistic curve  No calculus needed  There are two formats based on D  If D = 1.702, then diff < 0.01  If D = 1.0, a little more difference; called the true logistic form  Does not really matter, as long as you are consistent
  31. The logistic function  The basic form of the curve
  32. Item parameters  We add parameters to slightly modify the shape to get it to match our data  For example, a 4-option multiple choice item has a 25% chance of being guessed correctly  So we add a c parameter as a lower asymptote, which means that the curve is “squished” so it never goes below 0.25 (next)
  33. Item parameters  Sample IRF to show c
  34. Item parameters  We can also add a parameter (a) that modifies the slope  And a b parameter that slides the entire curve left or right ◦ Tells us which person z-score for which the item is appropriate  Items can be evaluated based on these just like with CTT statistics  A little more next…
  35. Item parameters: a  The a parameter ranges from 0.0 to about 2.0 in practice (theoretically to infinity)  Higher means better discriminating  For achievement testing, 0.7 or 0.8 is good, aptitude testing is higher  Helps you: Remove items with a<0.4? Identify a>1.0 as great items?
  36. Item parameters: b  For what person z-score is the item appropriate? (non-Rasch)  Should be between -3 and 3 ◦ 99.9% of students are in that range  0.0 is average person  1.0 is difficult (85th percentile)  -1.0 is easy (15th percentile)  2.0 is super difficult (98%)  -2.0 is super easy (2%)
  37. Item parameters: b  If item difficulties are normally distributed, where does this fall? (Rasch)  0.0 is average item (NOT PERSON)
  38. Item parameters: c  The c parameter should be about 1/k, where k is the number of options  If higher, this indicates that options are not attractive  For example, suppose c = 0.5  This means there is a 50/50 chance  That implies that even the lowest students are able to ignore two options and guess between the other two options
  39. Item parameters  Extreme example: ◦ What is 23+25?  A. 48  B. 47  C. 3.141529…  D. 1,256,457
  40. The (3PL) logistic function  Here is the equation for the 3PL, so you can see where the parameters are inserted  Item i, person j  Equivalent formulations can be seen in the literature, like moving the (1-c) above the line ( ) 1 ( 1| ) (1 ) 1 i j ii i j i i Da b P X c c e         
  41. The (3PL) logistic function  ai is the item discrimination parameter for item i,  bi is the item difficulty or location parameter for item i,  ci is the lower asymptote, or pseudoguessing parameter for item i,  D is the scaling constant equal to 1.702 or 1.0.
  42. The (3PL) logistic function  The P is due primarily to (-b)  The effect due to a and c is not as strong  That is, your probability of getting the item correct is mostly due to whether it is easy/difficult for you ◦ This leads to the idea of adaptive testing
  43. 3PL  IRT has 3 dichotomous models  I’ll now go through the models with more detail, from 3PL down to 1PL  The 3PL is appropriate for knowledge or ability testing, where guessing is relevant  Each item will have an a, b, and c parameter
  44. IRT models  Three 3PL IRFs, c = 0, 0.1, 0.2, (b = -1, 0, 1; a = 1, 1, 1) -3 -2 -1 0 1 2 3 0.00.20.40.60.81.0 theta probability
  45. 2PL  The 2PL assumes that there is no guessing (c = 0.0)  Items can still differ in discrimination  This is appropriate for attitude or psychological type data with dichotomous responses ◦ I like recess time at school (T/F) ◦ My favorite subject is math (T/F)
  46. IRT models  Three 2PL IRFs, a = 0.75, 1.5, 0.3, b = -1.0, 0.0, 1.0 -3 -2 -1 0 1 2 3 0.00.20.40.60.81.0 theta probability
  47. 1PL  The 1PL assumes that all items are of equal discrimination  Items only differ in terms of difficulty  The raw score is now a sufficient statistic for the IRT score  Not the case with 2PL or 3PL; it’s not just how many items you get right, but which ones  10 hard items vs. 10 easy items
  48. 1PL  The 1PL is also appropriate for attitude or psychological type data, but where there is no reason to believe items differ substantially in terms of discrimination  This is rarely the case  Still used: see Rasch discussion later
  49. 1PL  Three 1PL IRFs: b = -1, 0, 1 -3 -2 -1 0 1 2 3 0.00.20.40.60.81.0 theta probability
  50. How to choose?  Characteristics of the items  Check with the data! (fit)  Sample size: ◦ 1PL = 100 minimum ◦ 2PL = 300 minimum ◦ 3PL = 500 minimum  Score report considerations (sufficient statistics)
  51. The Rasch Perspective  Another argument in choice  There is a group of psychometricians (mostly from Australia and Chicago) who believe that the 1PL is THE model  Everything else is just noise  Data should be “cleaned” to reflect this
  52. The Rasch Perspective  How to clean? A big target is to eliminate guessing  But how do you know?  Slumdog Millionaire Effect
  53. The Rasch Perspective  This group is very strong in their belief  Why? They believe it is “objective” measurement  Score scale centered on items, not people, so “person-free”  Software and journals devoted just to the Rasch idea
  54. The Rasch Perspective  Should you use it?  I was trained to never use Rasch ◦ Equal discrimination assumption is completely unrealistic… we all know some items are better than others ◦ We all know guessing should not be ignored ◦ Data should probably not be doctored ◦ Instead, data should drive the model
  55. The Rasch Perspective  However, while some researchers hate the Rasch model, I don’t ◦ It is very simple ◦ It works better with tiny samples ◦ It is easier to describe ◦ Score reports and sufficient statistics ◦ Discussion points from you? ◦ Nevertheless, I recommend IRT
  56. Polytomous models  Polytomous models are for items that are not scored correct/incorrect, yes/no, etc.  Two types: ◦ Rating scale or Likert: “Rate on a scale of 1 to 5” ◦ Partial credit – very useful in constructed-response educational items  My experience as a scorer
  57. Polytomous models  Partial credit example with rubric: ◦ Open response question to “2+3(4+5)=“  0: no answer  1: 2, 3, 4, or 5 (picks one)  2: 14 (adds all)  3: 45 (does (2+3) x (4+5) )  4: 27 (everything but add 2)  5: 29 (correct)
  58. The IRF  Polytomous example (CRFs):
  59. Comparison table Model Item Disc. Step Spacing Step Ordering Option Disc. RSM Fixed Fixed Fixed Fixed PCM Fixed Variable Variable Fixed GRSM Variable Fixed Fixed Fixed GRM Variable Variable Fixed Fixed GPCM Variable Variable Variable Fixed NRM Variable (each option) Variable Variable Variable Fixed/Variable between items… more later, if time
  60. Part 3 Ability () estimation (IRT Scoring)
  61. Scoring  First: throw out your idea of a “score” as the number of items correct  We actually want something more accurate: the precise z-score  Because the z-scores axis is called θ in IRT, the scoring is called θ estimation
  62. Scoring  IRT utilizes the IRFs in scoring examinees  If an examinee gets a question right, they “get” the item’s IRF  If they get the question wrong, they “get” the (1-IRF)  These curves are multiplied for all items to get a final curve called the likelihood function
  63. Scoring  Here’s an example IRF; a =1, b=0, c = 0
  64. Scoring  A “1-IRF”
  65. Scoring  We multiply those to get a curve like this…
  66. Scoring - MLE  The score is the point on the x-axis where the highest likelihood is  This is the maximum likelihood estimate  In the example, 0.0 (average ability)  This obtains precise estimates on the  scale
  67. Maximum likelihood  The LF is technically defined as:  Where u is a response vector of 1s and 0s  Note what this does to the exponents   ij i j n u 1 u j ij ij i 1 L P Qu     %
  68. Scoring - SEM  A quantification of just how precise can also be calculated, called the standard error of measurement  This is assumed to be the same for everyone in classical test theory, but in IRT depends on the items and the responses, and the level of 
  69. Scoring - SEM  Here’s a new LF – blue has the same MLE but is less spread out  Both are two items, blue with a = 2
  70. Scoring - SEM  The first LF had an SEM ~ 1.0  The second LF had an SEM ~ 0.5  We have more certainty about the second person’s score  This shows how much high-quality items aid in measurement ◦ Same items and responses, except a higher a
  71. Scoring - SEM  SEM is usually used to stop CATs  General interpretation: confidence interval  Plus or minus 1.96 (about 2) is 95%  So if the SEM in the example is 0.5, we are 95% sure that the student’s true ability is somewhere between -1.0 and +1.0
  72. Scoring - SEM  If a student gives aberrant responses (cheating, not paying attention, etc.) they will have a larger SEM  This is not enough to accuse of cheating (they could have just dozed off), but it can provide useful information for research
  73. Scoring - SEM  SEM CI is also used to make decisions ◦ Pass if 2 SEMs above a cutoff
  74. Details on IRT Scores  Student scores are on the  scale, which is analogous to the standard normal z scale – same interpretations!  There are four methods of scoring ◦ Maximum Likelihood (MLE) ◦ Bayesian Modal (or MAP, for maximum a posteriori) ◦ Bayesian EAP (expectation a posteriori) ◦ Weighted MLE (less common)
  75. Maximum likelihood  Take the likelihood function “as is” and find the highest point
  76. Maximum likelihood  Problem: all incorrect or all correct answers
  77. Bayesian modal  Addresses that problem by always multiplying the LF by a bell-shaped curve, which forces it to have a maximum somewhere  Still find the highest point
  78. Bayesian EAP  Argues that the curve is not symmetrical, and we should not ignore everything except the maximum  So it takes the “average” of the curve by splitting it into many slices and finding the weighted average  The slices are called quadrature points or nodes
  79. Bayesian EAP  Example: see 3PL tail
  80. Bayesian EAP  Simple EAP overlay:  ~ -0.50
  81. Bayesian  Why Bayesian? ◦ Nonmixed response vectors ◦ Asymmetric LF  Why not Bayesian? ◦ Biased inward – if you find the  estimates of 1000 students, the SD would be smaller with the Bayesian estimates, maybe 0.95
  82. Newton-Raphson  Most IRT software actually uses a somewhat different approach to MLE and Bayesian Modal  The straightforward way is to calculate the value of the LF at each point in , within reason  For example, -4 to 4 at 0.001  That’s 8,000 calculations! Too much for 1970s computers…
  83. Newton-Raphson  Newton-Raphson is a shortcut method that searches the curve iteratively for its maximum  Why? Same 0.001 level of accuracy in only 5 to 20 iterations  Across thousands of students, that is a huge amount of calculations saved  But certain issues (local maxima or minima)… maybe time to abandon?
  84. Examples  See IRT Scoring and Graphing Tool
  85. Part 4 Item parameter estimation How do we get a, b, and c?
  86. The estimation problem  Estimating student  given a set of known item parameters is easy because we have something established  But what about the first time a test is given?  All items are new, and there are no established student scores
  87. The estimation problem  Which came first, the chicken or the egg?  Since we don’t know, we go back and forth, trying one and then the other ◦ Fix “temporary” z-scores ◦ Estimate item parameters ◦ Fix the new item parameters ◦ Estimate scores ◦ Do it again until we’re satisfied
  88. Calibration algorithms  There are two calibration algorithms ◦ Joint maximum likelihood (JML) – older ◦ Marginal maximum likelihood (MML) – newer, and works better with smaller samples… the standard ◦ Also conditional maximum likelihood, but it only works with 1PL, so rarer ◦ New in research, but not in standard software: Markov chain monte carlo
  89. Calibration algorithms  The term maximum likelihood is used here because we are maximizing the likelihood of the entire data set, for all items i and persons j  X is the data set of responses xij  b is the set of item parameters bi   is the set of examinee js
  90. Calibration algorithms  This means we want to find the b and  that make that number the largest  So we set , find a good b, use it to score students and find a new , find a better b, etc… ◦ Marginal ML uses marginal distributions not exact points, hence it being faster and working better with smaller samples of people/items
  91. Calibration algorithms  Note: rather than examine the LF (which gets incredibly small), software examines -2*ln(LF)  IRT software tracks these iterations because they provide information on model fit  See output
  92. Part 4 (cont.) Assumptions of IRT: Model-data fit
  93. Checking fit  One assumption of IRT (#2) is that our data even follows the idea of IRT!  This is true at both the item and the test level  Also true about examinees: they should be getting items wrong that are above their θ and getting items correct that are below their θ
  94. Model-data fit  Whenever fitting any mathematical model to empirical data (not just IRT), it is important to assess fit  Fit refers to whether the model adequately represents the data  Alternatively, if the data is far away from the model
  95. Model-data fit  There are two types of fit important in IRT ◦ Item (and test) - compares observed data to the IRF ◦ Person – evaluates whether individual students are responding according to the model  Easy items correct, hard items incorrect
  96. Model-data fit  Remember the 10-group empirical IRF that I drew? This is great!
  97. Model-data fit  You’re more likely to see something like this:
  98. Model-data fit  Or even worse…
  99. Model-data fit  Note that if we drew an IRF in each of those graphs, it would be about the same  But it is obviously less appropriate in Graph #3 (“even worse”)  Fit analyses provide a way of quantifying this
  100. Item fit  Most basic approach is to subtract observed frequency correct from the expected value for each slice (g) of   This is then summarized in a chi- square statistic  Bigger = bad fit
  101. Item fit  Graphical depiction:
  102. Item fit  Better fit
  103. Item fit  The slices are called quadrature points  Also used for item parameter estimation  The number of slices for chi-square need not be the same as for estimation, but it helps interpretation
  104. Item fit  Chi-square is oversensitive to sample size  A better way is to compute standardized residuals  Divide a chi-square by its df = G-m where m is the number of item parameters  This is more interpretable because of the well-known scale  0 is OK, examine items > 2
  105. Item fit  For broad analysis of fit, use quantile plots (Xcalibre, Iteman, or Lertap) ◦ 3 to 7 groups ◦ Can find hidden issues (My example: social desirability in Likert #2)  See Xcalibre output ◦ Fit statistics ◦ Fit graphs (many more groups, and IRF)
  106. Person fit  Is an examinee responding oddly?  Most basic measure: take the log of the LF at the max ( estimate)  A higher number means we are more sure of the estimate  But this is dependent on the level of , so we need it standardized: lz       n 1i u1 i u io ii ˆQˆPlnl
  107. Person fit  lz is like a z-score for fit: z = (x-μ)/s  Less than -2 means bad fit                                                       n 1i 2 i i iio n 1i iiiio o oo z ˆP1 ˆP lnˆP1ˆPlVar ˆP1lnˆP1ˆPlnˆPlE lVar lEl l
  108. Person fit  lz is sensitive to the distribution of item difficulties  Works best when there is a range of difficulty  That is, if there are no items for high-ability examinees, none of them will have a good estimate!  Best to evaluate groups, not individuals
  109. How is fit useful?  Throw out items?  Throw out people?  Change model used?  Bad fit can flag other possible issues ◦ Speededness: fit (and N) gets worse at end of test ◦ Multidimensionality: certain areas
  110. How is fit useful?  Note that this fits in with the estimation process  IRT calibration is not “one-click”  Review results, then make adjustments ◦ Remove items/people ◦ Modify par distributions ◦ Modify quadrature points ◦ Etc.
  111. Summary  That was a basic intro to the rationale of IRT  Now start talking about some applications and uses  Also examine IRT software and output
Advertisement