Item analysis


Published on

Published in: Business, Economy & Finance
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Item analysis

  1. 1. Item Analysis Table of Contents Major Uses of Item Analysis Item Analysis Reports Item Analysis Response Patterns Basic Item Analysis Statistics Interpretation of Basic Statistics Other Item Statistics Summary Data Report Options Item Analysis Guidelines Major Uses of Item Analysis Item analysis can be a powerful technique available to instructors for the guidance and improvement of instruction. For this to be so, the items to be analyzed must be valid measures of instructional objectives. Further, the items must be diagnostic, that is, knowledge of which incorrect options students select must be a clue to the nature of the misunderstanding, and thus prescriptive of appropriate remediation. In addition, instructors who construct their own examinations may greatly improve the effectiveness of test items and the validity of test scores if they select and rewrite their items on the basis of item performance data. Such data is available to instructors who have their examination answer sheets scored at the Computer Laboratory Scoring Office. [ Top ] Item Analysis Reports As the answer sheets are scored, records are written which contain each student's score and his or her response to each item on the test. These records are then processed and an item analysis report file is generated. An instructor may obtain test score distributions and a list of students' scores, in alphabetic order, in student number order, in percentile rank order, and/or in order of percentage of total points. Instructors are sent their item analysis reports from as e-mail attacments. The item analysis report is contained in the file IRPT####.RPT, where the four digits indicate the instructors's GRADER III file. A sample of an individual long form item analysis lisitng is shown below. Upper 27% Middle 46% Lower 27% Total Item 10 of 125. The correct option is 5. Item Response Pattern 1 2 3 4 5 Omit 2 8 0 1 19 0 7% 27% 0% 3% 63% 0% 3 20 3 3 23 0 6% 38% 6% 6% 44% 0% 6 5 8 2 9 0 20% 17% 27% 7% 30% 0% 11 33 11 6 51 0 Error 0 0% 0 0% 0 0% 0 Total 30 100% 52 100% 30 101% 112
  2. 2. 10% 29% 11% 5% 46% 0% 0% 100% [ Top ] Item Analysis Response Patterns Each item is identified by number and the correct option is indicated. The group of students taking the test is divided into upper, middle and lower groups on the basis of students' scores on the test. This division is essential if information is to be provided concerning the operation of distracters (incorrect options) and to compute an easily interpretable index of discrimination. It has long been accepted that optimal item discrimination is obtained when the upper and lower groups each contain twenty-seven percent of the total group. The number of students who selected each option or omitted the item is shown for each of the upper, middle, lower and total groups. The number of students who marked more than one option to the item is indicated under the "error" heading. The percentage of each group who selected each of the options, omitted the item, or erred, is also listed. Note that the total percentage for each group may be other than 100%, since the percentages are rounded to the nearest whole number before totaling. The sample item listed above appears to be performing well. About two-thirds of the upper group but only one-third of the lower group answered the item correctly. Ideally, the students who answered the item incorrectly should select each incorrect response in roughly equal proportions, rather than concentrating on a single incorrect option. Option two seems to be the most attractive incorrect option, especially to the upper and middle groups. It is most undesirable for a greater proportion of the upper group than of the lower group to select an incorrect option. The item writer should examine such an option for possible ambiguity. For the sample item on the previous page, option four was selected by only five percent of the total group. An attempt might be made to make this option more attractive. Item analysis provides the item writer with a record of student reaction to items. It gives us little information about the appropriateness of an item for a course of instruction. The appropriateness or content validity of an item must be determined by comparing the content of the item with the instructional objectives. [ Top ] Basic Item Analysis Statistics A number of item statistics are reported which aid in evaluating the effectiveness of an item. The first of these is the index of difficulty which is the proportion of the total group who got the item wrong. Thus a high index indicates a difficult item and a low index indicates an easy item. Some item analysts prefer an index of difficulty which is the proportion of the total group who got an item right. This index may be obtained by marking the PROPORTION RIGHT option on the item analysis header sheet. Whichever index is selected is shown as the INDEX OF DIFFICULTY on the item analysis print-out. For classroom achievement tests, most test constructors desire items with indices of difficulty no lower than 20 nor higher than 80, with an average index of difficulty from 30 or 40 to a maximum of 60. The INDEX OF DISCRIMINATION is the difference between the proportion of the upper group who got an item right and the proportion of the lower group who got the item right. This index is dependent upon the difficulty of an item. It may reach a maximum value of 100 for an item with an index of difficulty of 50, that is, when 100% of the upper group and none of the lower group answer the item correctly. For items of less than or greater than 50 difficulty, the index of discrimination has a maximum value of less than 100. The Interpreting the Index of Discrimination document contains a more detailed discussion of the index of discrimination. [ Top ]
  3. 3. Interpretation of Basic Statistics To aid in interpreting the index of discrimination, the maximum discrimination value and the discriminating efficiency are given for each item. The maximum discrimination is the highest possible index of discrimination for an item at a given level of difficulty. For example, an item answered correctly by 60% of the group would have an index of difficulty of 40 and a maximum discrimination of 80. This would occur when 100% of the upper group and 20% of the lower group answered the item correctly. The discriminating efficiency is the index of discrimination divided by the maximum discrimination. For example, an item with an index of discrimination of 40 and a maximum discrimination of 50 would have a discriminating efficiency of 80. This may be interpreted to mean that the item is discriminating at 80% of the potential of an item of its difficulty. For a more detailed discussion of the maximum discrimination and discriminating efficiency concepts, see the Interpreting the Index of Discrimination document. [ Top ] Other Item Statistics Some test analysts may desire more complex item statistics. Two correlations which are commonly used as indicators of item discrimination are shown on the item analysis report. The first is the biserial correlation, which is the correlation between a student's performance on an item (right or wrong) and his or her total score on the test. This correlation assumes that the distribution of test scores is normal and that there is a normal distribution underlying the right/wrong dichotomy. The biserial correlation has the characteristic, disconcerting to some, of having maximum values greater than unity. There is no exact test for the statistical significance of the biserial correlation coefficient. The point biserial correlation is also a correlation between student performance on an item (right or wrong) and test score. It assumes that the test score distribution is normal and that the division on item performance is a natural dichotomy. The possible range of values for the point biserial correlation is +1 to -1. The Student's t test for the statistical significance of the point biserial correlation is given on the item analysis report. Enter a table of Student's t values with N - 2 degrees of freedom at the desired percentile point N, in this case, is the total number of students appearing in the item analysis. The mean scores for students who got an item right and for those who got it wrong are also shown. These values are used in computing the biserial and point biserial coefficients of correlation and are not generally used as item analysis statistics. Generally, item statistics will be somewhat unstable for small groups of students. Perhaps fifty students might be considered a minimum number if item statistics are to be stable. Note that for a group of fifty students, the upper and lower groups would contain only thirteen students each. The stability of item analysis results will improve as the group of students is increased to one hundred or more. An item analysis for very small groups must not be considered a stable indication of the performance of a set of items. [ Top ] Summary Data The item analysis data are summarized on the last page of the item analysis report. The distribution of item difficulty indices is a tabulation showing the number and percentage of items whose difficulties are in each of ten categories, ranging from a very easy category (00-10) to a very difficult category (91-100). The distribution of discrimination indices is tabulated in the same manner, except that a category is included for negatively discriminating items.
  4. 4. The mean item difficulty is determined by adding all of the item difficulty indices and dividing the total by the number of items. The mean item discrimination is determined in a similar manner. Test reliability, estimated by the Kuder-Richardson formula number 20, is given. If the test is speeded, that is, if some of the students did not have time to consider each test item, the reliability estimate may be spuriously high. The final test statistic is the standard error of measurement. This statistic is a common device for interpreting the absolute accuracy of the test scores. The size of the standard error of measurement depends on the standard deviation of the test scores as well as on the estimated reliability of the test. Occasionally, a test writer may wish to omit certain items from the analysis although these items were included in the test as it was administered. Such items may be omitted by leaving them blank on the test key. The response patterns for omitted items will be shown but the keyed options will be listed as OMIT. The statistics for these items will be omitted from the Summary Data. [ Top ] Report Options A number of report options are available for item analysis data. The long-form item analysis report contains three items per page. A standard-form item analysis report is available where data on each item is summarized on one line. A sample reprot is shown below. Item Key 1 4 2 2 ITEM ANALYSIS Test 4482 125 Items 112 Students Percentages: Upper 27% - Middle - Lower 27% 1 2 3 4 5 Omit Error Diff Disc 7-23-57 0- 4- 7 28- 8-36 64-62- 0 0-0-0 0-0-0 0-0-0 54 64 7-12- 7 64-42-29 14- 4-21 14-42-36 0-0-0 0-0-0 0-0-0 56 35 The standard form shows the item number, key (number of the correct option), the percentage of the upper, middle, and lower groups who selected each option, omitted the item or erred, the index of difficulty, and the index of discrimination. For example, in item 1 above, option 4 was the correct answer and it was selected by 64% of the upper group, 62% of the middle group and 0% of the lower group. The index of difficulty, based on the total group, was 54 and the index of discrimination was 64. [ Top ] Item Analysis Guidelines Item analysis is a completely futile process unless the results help instructors improve their classroom practices and item writers improve their tests. Let us suggest a number of points of departure in the application of item analysis data. 1. Item analysis gives necessary but not sufficient information concerning the appropriateness of an item as a measure of intended outcomes of instruction. An item may perform beautifully with respect to item analysis statistics and yet be quite irrelevant to the instruction whose results it was intended to measure. A most common error is to teach for behavioral objectives such as analysis of data or situations, ability to discover trends, ability to infer meaning, etc., and then to construct an objective test measuring mainly recognition of facts. Clearly, the objectives of instruction must be kept in mind when selecting test items. 2. An item must be of appropriate difficulty for the students to whom it is administered. If possible, items should have indices of difficulty no less than 20 and no greater than 80. lt
  5. 5. is desirable to have most items in the 30 to 50 range of difficulty. Very hard or very easy items contribute little to the discriminating power of a test. 3. An item should discriminate between upper and lower groups. These groups are usually based on total test score but they could be based on some other criterion such as gradepoint average, scores on other tests, etc. Sometimes an item will discriminate negatively, that is, a larger proportion of the lower group than of the upper group selected the correct option. This often means that the students in the upper group were misled by an ambiguity that the students in the lower group, and the item writer, failed to discover. Such an item should be revised or discarded. 4. All of the incorrect options, or distracters, should actually be distracting. Preferably, each distracter should be selected by a greater proportion of the lower group than of the upper group. If, in a five-option multiple-choice item, only one distracter is effective, the item is, for all practical purposes, a two-option item. Existence of five options does not automatically guarantee that the item will operate as a five-choice item. [ Top ] Item analysis is a general term that refers to the specific methods used in education to evaluate test items, typically for the purpose of test construction and revision. Regarded as one of the most important aspects of test construction and increasingly receiving attention, it is an approach incorporated into item response theory (IRT), which serves as an alternative to classical measurement theory (CMT) or classical test theory (CTT). Classical measurement theory considers a score to be the direct result of a person's true score plus error. It is this error that is of interest as previous measurement theories have been unable to specify its source. However, item response theory uses item analysis to differentiate between types of error in order to gain a clearer understanding of any existing deficiencies. Particular attention is given to individual test items, item characteristics, probability of answering items correctly, overall ability of the test taker, and degrees or levels of knowledge being assessed. THE PURPOSE OF ITEM ANALYSIS There must be a match between what is taught and what is assessed. However, there must also be an effort to test for more complex levels of understanding, with care taken to avoid over-sampling items that assess only basic levels of knowledge. Tests that are too difficult (and have an insufficient floor) tend to lead to frustration and lead to deflated scores, whereas tests that are too easy (and have an insufficient ceiling) facilitate a decline in motivation and lead to inflated scores. Tests can be improved by maintaining and developing a pool of valid items from which future tests can be drawn and that cover a reasonable span of difficulty levels. Item analysis helps improve test items and identify unfair or biased items. Results should be used to refine test item wording. In addition, closer examination of items will also reveal which questions were most difficult, perhaps indicating a concept that needs to be taught more thoroughly. If a particular distracter (that is, an incorrect answer choice) is the most often chosen answer, and especially if that distracter positively correlates with a high total score, the item must be examined more closely for correctness. This situation also provides an opportunity to identify and examine common misconceptions among students about a particular concept. In general, once test items have been created, the value of these items can be systematically assessed using several methods representative of item analysis: a) a test item's level of difficulty, b) an item's capacity to discriminate, and c) the item characteristic curve. Difficulty is assessed by examining the number of persons correctly endorsing the answer. Discrimination can be examined by comparing the number of persons getting a particular item correct with the total test score. Finally, the item characteristic curve can be used to plot the likelihood of answering correctly with the level of success on the test. ITEM DIFFICULTY In test construction, item difficulty is determined by the number of people who answer a particular test item correctly. For example, if the first question on a test was answered correctly by 76% of the class, then the difficulty level (p or percentage passing) for that question is p = .76. If the second question on a test was answered correctly by only 48% of the class, then the
  6. 6. difficulty level for that question is p = .48. The higher the percentage of people who answer correctly, the easier the item, so that a difficulty level of .48 indicates that question two was more difficult than question one, which had a difficulty level of .76. Many educators find themselves wondering how difficult a good test item should be. Several things must be taken into consideration in order to determine appropriate difficulty level. The first task of any test maker should be to determine the probability of answering an item correctly by chance alone, also referred to as guessing or luck. For example, a true-false item, because it has only two choices, could be answered correctly by chance half of the time. Therefore, a true-false item with a demonstrated difficulty level of only p = .50 would not be a good test item because that level of success could be achieved through guessing alone and would not be an actual indication of knowledge or ability level. Similarly, a multiple-choice item with five alternatives could be answered correctly by chance 20% of the time. Therefore, an item difficulty greater than .20 would be necessary in order to discriminate between respondents' ability to guess correctly and respondents' level of knowledge. Desirable difficulty levels usually can be estimated as halfway between 100 percent and the percentage of success expected by guessing. So, the desirable difficulty level for a true-false item, for example, should be aroundp = .75, which is halfway between 100% and 50% correct. In most instances, it is desirable for a test to contain items of various difficulty levels in order to distinguish between students who are not prepared at all, students who are fairly prepared, and students who are well prepared. In other words, educators do not want the same level of success for those students who did not study as for those who studied a fair amount, or for those who studied a fair amount and those who studied exceptionally hard. Therefore, it is necessary for a test to be composed of items of varying levels of difficulty. As a general rule for norm-referenced tests, items in the difficulty range of .30 to .70 yield important differences between individuals' level of knowledge, ability, and preparedness. There are a few exceptions to this, however, with regard to the purpose of the test and the characteristics of the test takers. For instance, if the test is to help determine entrance into graduate school, the items should be more difficult to be able to make finer distinctions between test takers. For a criterion-referenced test, most of the item difficulties should be clustered around the criterion cut-off score or higher. For example, if a passing score is 70%, the vast majority of items should have percentage passing values of Figure 1ILLUSTRATION BY GGS INFORMATION SERVICES. CENGAGE LEARNING, GALE. p = .60 or higher, with a number of items in the p > .90 range to enhance motivation and test for mastery of certain essential concepts. DISCRIMINATION INDEX According to Wilson (2005), item difficulty is the most essential component of item analysis. However, it is not the only way to evaluate test items. Discrimination goes beyond determining the proportion of people who answer correctly and looks more specifically at who answers correctly. In other words, item discrimination determines whether those who did well on the entire test did well on a particular item. An item should in fact be able to discriminate between upper and lower scoring groups. Membership in these groups is usually determined based on their total test score, and it is expected that those scoring higher on the overall test will also be more likely to endorse the correct response on a particular item. Sometimes an item will discriminate negatively, that is, a larger proportion of the lower group select the correct response, as compared to those in the higher scoring group. Such an item should be revised or discarded. One way to determine an item's power to discriminate is to compare those who have done very well with those who have done very poorly, known as the extreme group method. First, identify the students who scored in the top one-third as well as those in the bottom one-third of the class. Next, calculate the proportion of each group that answered a particular test item correctly (i.e., percentage passing for the high and low groups on each item). Finally, subtract the p of the bottom performing group from the p for the top performing group to yield an item discrimination index (D). Item discriminations of D = .50 or higher are considered excellent. D = 0 means the item has no discrimination ability, while D = 1.00 means the item has perfect discrimination ability. In Figure 1, it can be seen that Item 1 discriminates well with those in the top performing group obtaining the correct response far more often (p = .92) than those in the
  7. 7. Figure 2ILLUSTRATION BY GGS INFORMATION SERVICES. CENGAGE LEARNING, GALE. low performing group (p = .40), thus resulting in an index of .52 (i.e., .92 - .40 = .52). Next, Item 2 is not difficult enough with a discriminability index of only .04, meaning this particular item was not useful in discriminating between the high and low scoring individuals. Finally, Item 3 is in need of revision or discarding as it discriminates negatively, meaning low performing group members actually obtained the correct keyed answer more often than high performing group members. Another way to determine the discriminability of an item is to determine the correlation coefficient between performance on an item and performance on a test, or the tendency of students selecting the correct answer to have high overall scores. This coefficient is reported as the item discrimination coefficient, or the point-biserial correlation between item score (usually scored right or wrong) and total test score. This coefficient should be positive, indicating that students answering correctly tend to have higher overall scores or that students answering incorrectly tend to have lower overall scores. Also, the higher the magnitude, the better the item discriminates. The point-biserial correlation can be computed with procedures outlined in Figure 2. In Figure 2, the point-biserial correlation between item score and total score is evaluated similarly to the extreme group discrimination index. If the resulting value is negative or low, the item should be revised or discarded. The closer the value is to 1.0, the stronger the item's discrimination power; the closer the value is to 0, Figure 3ILLUSTRATION BY GGS INFORMATION SERVICES. CENGAGE LEARNING, GALE. the weaker the power. Items that are very easy and answered correctly by the majority of respondents will have poor pointbiserial correlations. CHARACTERISTIC CURVE A third parameter used to conduct item analysis is known as the item characteristic curve (ICC). This is a graphical or pictorial depiction of the characteristics of a particular item, or taken collectively, can be representative of the entire test. In the item characteristic curve the total test score is represented on the horizontal axis and the proportion of test takers passing the item within that range of test scores is scaled along the vertical axis.
  8. 8. For Figure 3, three separate item characteristic curves are shown. Line A is considered a flat curve and indicates that test takers at all score levels were equally likely to get the item correct. This item was therefore not a useful discriminating item. Line B demonstrates a troublesome item as it gradually rises and then drops for those scoring highest on the overall test. Though this is unusual, it can sometimes result from those who studied most having ruled out the answer that was keyed as correct. Finally, Line C shows the item characteristic curve for a good test item. The gradual and consistent positive slope shows that the proportion of people passing the item gradually increases as test scores increase. Though it is not depicted here, if an ICC was seen in the shape of a backward S, negative item discrimination would be evident, meaning that those who scored lowest were most likely to endorse a correct response on the item. Eight Simple Steps to Item Analysis 1. Score each answer sheet, write score total on the corner o obviously have to do this anyway 2. Sort the pile into rank order from top to bottom score (1 minute, 30 seconds tops) 3. If normal class of 30 students, divide class in half o same number in top and bottom group: o toss middle paper if odd number (put aside) 4. Take 'top' pile, count number of students who responded to each alternative o fast way is simply to sort piles into "A", "B", "C", "D" // or true/false or type of error you get for short answer, fill-in-the-blank OR set up on spread sheet if you're familiar with computers ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS CLASS SIZE = 30 ITEM UPPER 1. A *B C D O LOWER DIFFERENCE D TOTAL DIFFICULTY 0 4 1 1 *=Keyed Answer o repeat for lower group ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS CLASS SIZE = 30 ITEM UPPER 1. A *B C D O 0 4 1 1 LOWER DIFFERENCE D TOTAL DIFFICULTY 2 *=Keyed Answer o this is the time consuming part --> but not that bad, can do it while watching TV, because you're just sorting piles
  9. 9. THREE POSSIBLE SHORT CUTS HERE (STEP 4) (A) If you have a large sample of around 100 or more, you can cut down the sample you work with o take top 27% (27 out of 100); bottom 27% (so only dealing with 54, not all 100) o put middle 46 aside for the moment  o larger the sample, more accurate, but have to trade off against labour; using top 1/3 or so is probably good enough by the time you get to 100; --27% magic figure statisticians tell us to use I'd use halves at 30, but you could just use a sample of top 10 and bottom 10 if you're pressed for time   o but it means a single student changes stats by 10% trading off speed for accuracy... but I'd rather have you doing ten and ten than nothing (B) Second short cut, if you have access to photocopier (budgets) o photocopy answer sheets, cut off identifying info (can't use if handwriting is distinctive) o o o o o colour code high and low groups --> dab of marker pen color distribute randomly to students in your class so they don't know whose answer sheet they have get them to raise their hands  for #6, how many have "A" on blue sheet? how many have "B"; how many "C"  for #6, how many have "A" on red sheet.... some reservations because they can screw you up if they don't take it seriously another version of this would be to hire kid who cuts your lawn to do the counting, provided you've removed all identifying information  I actually did this for a bunch of teachers at one high school in Edmonton when I was in university for pocket money (C) Third shortcut, IF you can't use separate answer sheet, sometimes faster to type than to sort SAMPLE OF TYPING FORMAT FOR ITEM ANALYSIS ITEM # KEY 1 2 3 4 5 6 7 8 9 10 T F T F T A D C A B STUDENT Kay Jane o T T T F T A D C A D John o T T T F F A D D A C F F T F T A D C A B type name; then T or F, or A,B,C,D == all left hand on typewriter, leaving right hand free to turn pages (from Sax) IF you have a computer program -- some kicking around -- will give you all stats you need, plus bunches more you don't-- automatically after this stage
  10. 10. OVERHEAD: SAMPLE ITEM ANALYSIS FOR CLASS OF 30 (PAGE #1) (in text) 5. Subtract the number of students in lower group who got question right from number of high group students who got it right o quite possible to get a negative number ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS CLASS SIZE = 30 ITEM UPPER 1. A *B C D O 0 4 1 1 LOWER 2 DIFFERENCE D TOTAL DIFFICULTY 2 *=Keyed Answer 6. Divide the difference by number of students in upper or lower group o in this case, divide by 15 o this gives you the "discrimination index" (D) ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS CLASS SIZE = 30 ITEM UPPER 1. A *B C D O 0 4 1 1 LOWER 2 2 DIFFERENCE D TOTAL DIFFICULTY 0.333 *=Keyed Answer 7. Total number who got it right ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS CLASS SIZE = 30 ITEM UPPER 1. A *B C D O 0 4 1 1 LOWER 2 2 DIFFERENCE 0.333 D TOTAL DIFFICULTY 6 *=Keyed Answer 8. If you have a large class and were only using the 1/3 sample for top and bottom groups, then you have to NOW count number of middle group who got each question right (not each alternative this time, just right answers) 9. Sample Form Class Size= 100. o if class of 30, upper and lower half, no other column here 10. Divide total by total number of students o difficulty = (proportion who got it right (p) ) ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS CLASS SIZE = 30 ITEM UPPER LOWER DIFFERENCE D TOTAL DIFFICULTY
  11. 11. 1. A *B C D O 0 4 1 1 2 2 0.333 6 .42 *=Keyed Answer 11. You will NOTE the complete lack of complicated statistics --> counting, adding, dividing --> no tricky formulas required for this o not going to worry about corrected point biserials etc. o one of the advantages of using fixed number of alternatives Interpreting Item Analysis Let's look at what we have and see what we can see 90% of item analysis is just common sense... 1. Potential Miskey 2. Identifying Ambiguous Items 3. EqualDistribution to all alternatives. 4. Alternatives are not working 5. Distracter too atractive. 6. Question not discriminating. 7. Negative discrimination. 8. Too Easy. 9. Omit. 10. &11. Relationship between D index and Difficulty (p). Item Analysis of Computer Printouts o . 1. What do we see looking at this first one? [Potential Miskey] Upper 1. *A B C D O o o o o Low Difference D Total 1 4 -3 -.2 5 1 3 10 5 3 3 <----means omit or no answer Difficulty .17 #1, more high group students chose C than A, even though A is supposedly the correct answer more low group students chose A than high group so got negative discrimination; only .16% of class got it right most likely you just wrote the wrong answer key down --> this is an easy and very common mistake for you to make better you find out now before you hand back then when kids complain OR WORSE, they don't complain, and teach themselves that your miskey as the "correct" answer so check it out and rescore that question on all the papers before handing them back Makes it 10-5 Difference = 5; D=.34; Total = 15; difficulty=.50 --> nice item   o o
  12. 12. OR: o you check and find that you didn't miskey it --> that is the answer you thought two possibilities: 1. one possibility is that you made slip of the tongue and taught them the wrong answer  anything you say in class can be taken down and used against you on an examination.... 2. more likely means even "good" students are being tricked by a common misconception --> You're not supposed to have trick questions, so may want to dump it --> give those who got it right their point, but total rest of the marks out of 24 instead of 25 If scores are high, or you want to make a point, might let it stand, and then teach to it --> sometimes if they get caught, will help them to remember better in future such as: very fine distinctions crucial steps which are often overlooked REVISE it for next time to weaken "B" -- alternatives are not supposed to draw more than the keyed answer -- almost always an item flaw, rather than useful distinction What can we see with #2: [Can identify ambiguous items] Upper 2. A B *C D O 6 1 7 1 Low 5 2 5 3 Difference 2 .13 D 12 Total Difficulty .40 #2, about equal numbers of top students went for A and D. Suggests they couldn't tell which was correct  either, students didn't know this material (in which case you can reteach it)  or the item was defective ---> look at their favorite alternative again, and see if you can find any reason they could be choosing it often items that look perfectly straight forward to adults are ambiguous to students FavoriteExamples of ambiguous items. if you NOW realize that D was a defensible answer, rescore before you hand it back to give everyone credit for either A or D -- avoids arguing with you in class if it's clearly a wrong answer, then you now know which error most of your students are making to get wrong answer useful diagnostic information on their learning, your teaching Equally to all alternatives Upper Low Difference D Total Difficulty
  13. 13. 3. A B *C D O 4 3 5 3 3 4 4 4 1 .06 9 .30 item #3, students respond about equally to all alternatives usually means they are guessing Three possibilities: 0. may be material you didn't actually get to yet  you designed test in advance (because I've convinced you to plan ahead) but didn't actually get everything covered before holidays....  or item on a common exam that you didn't stress in your class 1. item so badly written students have no idea what you're asking 2. item so difficult students just completely baffled review the item:  if badly written ( by other teacher) or on material your class hasn't taken, toss it out, rescore the exam out of lower total  BUT give credit to those that got it, to a total of 100%  if seems well written, but too hard, then you know to (re)teach this material for rest of class....  maybe the 3 who got it are top three students,  tough but valid item:  OK, if item tests valid objective  want to provide occasional challenging question for top students  but make sure you haven't defined "top 3 students" as "those able to figure out what the heck I'm talking about" Alternatives aren't working Upper 4. A *B C D O 1 14 0 0 Low 5 7 2 0 Difference 7 .47 D Total 21 Difficulty .77 example #4 --> no one fell for D --> so it is not a plausible alternative question is fine for this administration, but revise item for next time toss alternative D, replace it with something more realistic each distracter has to attract at least 5% of the students
  14. 14. class of 30, should get at least two students  or might accept one if you positively can't think of another fourth alternative -- otherwise, do not reuse the item if two alternatives don't draw any students --> might consider redoing as true/false Distracter too attractive Upper 5. A B C *D O 7 1 1 5 Low 10 2 1 2 Difference 3 D .20 Total 7 Difficulty .23 sample #5 --> too many going for A --> no ONE distracter should get more than key --> no one distracter should pull more than about half of students -- doesn't leave enough for correct answer and five percent for each alternative keep for this time weaken it for next time Question not discriminating Upper Low 6. *A 7 B 3 C 2 D 3 O 7 2 1 5 Difference 0 .00 D Total 14 Difficulty .47 sample #6: low group gets it as often as high group on norm-referenced tests, point is to rank students from best to worst so individual test items should have good students get question right, poor students get it wrong test overall decides who is a good or poor student on this particular topic  those who do well have more information, skills than those who do less well  so if on a particular question those with more skills and knowledge do NOT do better, something may be wrong with the question question may be VALID, but off topic  E.G.: rest of test tests thinking skill, but this is a memorization question, skilled and unskilled equally as likely to recall the answer
  15. 15.   should have homogeneous test --> don't have a math item in with social studies if wanted to get really fancy, should do separate item analysis for each cell of your long as you had six items per cell question is VALID, on topic, but not RELIABLE  addresses the specified objective, but isn't a useful measure of individual differences  asking Grade 10s Capital of Canada is on topic, but since they will all get it right, won't show individual differences -- give you low D Negative Discrimination Upper 7. *A 7 B 3 C 2 D 3 O Low 10 3 1 1 Difference -3 -.20 D Total 17 Difficulty .57 D (discrimination) index is just upper group minus lower group varies from +1.0 to -1.0 if all top got it right, all lower got it wrong = 100% = +1 if more of the bottom group get it right than the top group, you get a negative D index if you have a negative D, means that students with less skills and knowledge overall, are getting it right more often than those who the test says are better overall in other words, the better you are, the more likely you are to get it wrong WHAT COULD ACCOUNT FOR THAT? Two possibilities: usually means an ambiguous question  that is confusing good students, but weak students too weak to see the problem  look at question again, look at alternatives good students are going for, to see if you've missed something OR: or it might be off topic --> something weaker students are better at (like rote memorization) than good students --> not part of same set of skills as rest of test--> suggests design flaw with table of specifications perhaps ((-if you end up with a whole bunch of -D indices on the same test, must mean you actually have two different distinct skills, because by definition, the low group is the high group on that bunch of questions --> end up treating them as two separate tests))
  16. 16. if you have a large enough sample (like the provincial exams) then we toss the item and either don't count it or give everyone credit for it with sample of 100 students or less, could just be random chance, so basically ignore it in terms of THIS administration  kids wrote it, give them mark they got furthermore, if you keep dropping questions, may find that you're starting to develop serious holes in your blueprint coverage -- problem for sampling  but you want to track stuff this FOR NEXT TIME if it's negative on administration after administration, consistently, likely not random chance, it's screwing up in some way want to build your future tests out of those items with high positive D indices the higher the average D indices on the test, the more RELIABLE the test as a whole will be revise items to increase D -->if good students are selecting one particular wrong alternative, make it less attractive -->or increase probability of their selecting right answer by making it more attractive may have to include some items with negative Ds if those are the only items you have for that specification, and it's an important specification  what this means is that there are some skills/knowledge in this unit which are unrelated to rest of the skills/knowledge --> but may still be important e.g., statistics part of this course may be terrible on those students who are the best item writers, since writing tends to be associated with the opposite hemisphere in the brain than math, right... but still important objective in this course  may lower reliability of test, but increases content validity Too Easy Upper 8. A *B C D O 0 14 0 1 Low 1 13 1 1 Difference 1 .06 D Total 27 Difficulty .90 too easy or too difficult won't discriminate well either difficulty (p) (for proportion) varies from +1.0 (everybody got it right) to 0 (nobody) REMEMBER: THE HIGHER THE DIFFICULTY INDEX, THE EASIER THE QUESTION if the item is NOT miskeyed or some other glaring problem, it's too late to change after administered --> everybody got it right, OK, give them the mark TOO DIFFICULT = 30 to 35% (used to be rule in Branch, now not...)
  17. 17. if the item is too difficult, don't drop it, just because everybody missed it --> you must have thought it was an important objective or it wouldn't have been on there; and unless literally EVERYONE missed it, what do you do with the students who got it right? give them bonus marks? cheat them of a mark they got? furthermore, if you drop too many questions, lose content validity (specs) --> if two or three got it right may just be random chance, so why should they get a bonus mark however, DO NOT REUSE questions with too high or low difficulty (p) values in future if difficulty is over 85%, you're wasting space on limited item test asking Grade 10s the Capital of Canada is probably waste of their time and yours --> unless this is a particularly vital objective same applies to items which are too difficult --> no use asking Grade 3s to solve quadratic equation but you may want to revise question to make it easier or harder rather than just toss it out cold OR SOME EXCEPTIONS HERE: You may have consciously decided to develop a "Mastery" style tests --> will often have very easy questions -& expect everyone to get everything trying to identify only those who are not ready to go on --> in which case, don't use any question which DOES NOT have a difficulty level below 85% or whatever Or you may want a test to identify the top people in class, the reach for the top team, and design a whole test of really tough questions --> have low difficulty values (i.e., very hard) so depends a bit on what you intend to do with the test in question this is what makes the difficulty index (proportion) so handy 13. you create a bank of items over the years --> using item analysis you get better questions all the time, until you have a whole bunch that work great -->can then tailor-make a test for your class you want to create an easier test this year, you pick questions with higher difficulty (p) values; you want to make a challenging test for your gifted kids, choose items with low difficulty (p) values --> for most applications will want to set difficulty level so that it gives you average marks, nice bell curve  government uses 62.5 --> four item multiple choice, middle of bell curve,
  18. 18. 14. start tests with an easy question or two to give students a running start 15. make sure that the difficulty levels are spread out over examination blueprint not all hard geography questions, easy history  unfair to kids who are better at geography, worse at history turns class off geography if they equate it with tough questions   -->REMEMBER here that difficulty is different than complexity, Bloom  so can have difficult recall knowledge question, easy synthesis  synthesis and evaluation items will tend to be harder than recall questions so if find higher levels are more difficult, OK, but try to balance cells as much as possible  certainly content cells should be the roughly the same OMIT Upper 9. A B *C D O 2 3 7 1 2 Low 1 4 3 1 4 Difference 4 .26 D Total 10 Difficulty .33 If near end of the test 0. --> they didn't find it because it was on the next page --format problem OR --> your test is too long, 6 of them (20%) didn't get to it OR, if middle of the test: 3. --> totally baffled them because:  way too difficult for these guys  or because also 2 from high group too: ambiguous wording 2. & 3. RELATIONSHIP BETWEEN D INDEX AND DIFFICULTY (p) Upper Low Difference D Total Difficulty 10. A 0 5 *B 15 0 15 1.0 15 .50 C 0 5 D 0 5 O --------------------------------------------------11. A 3 2
  19. 19. *B C D O o 8 2 2 7 3 3 1 0.6 15 .50 10 is a perfect item --> each distracter gets at least 5 discrimination index is +1.0 (ACTUALLY PERFECT ITEM WOULD HAVE DIFFICULTY OF 65% TO ALLOW FOR GUESSING) o o high discrimination D indices require optimal levels of difficulty but optimal levels of difficulty do not assure high levels of D o 11 has same difficulty level, different D  on four item multiple-choice, student doing totally by chance will get 25% Item analysis An item analysis involves many statistics that can provide useful information for improving the quality and accuracy of multiple-choice or true/false items (questions). Some of these statistics are: Item difficulty: the percentage of students that correctly answered the item. Also referred to as the p-value. The range is from 0% to 100%, or more typically written as a proportion of 0.0 to 1.00. The higher the value, the easier the item. Calculation: Divide the number of students who got an item correct by the total number of students who answered it. Ideal value: Slightly higher than midway between chance (1.00 divided by the number of choices) and a perfect score (1.00) for the item. For example, on a four-alternative, multiple-choice item, the random guessing level is 1.00/4 = 0.25; therefore, the optimal difficulty level is .25 + (1.00 - .25) / 2 = 0.62. On a true-false question, the guessing level is (1.00/2 = .50) and, therefore, the optimal difficulty level is .50+(1.00-.50)/2 = .75 P-values above 0.90 are very easy items and should be carefully reviewed based on the instructor’s purpose. For example, if the instructor is using easy “warm-up” questions or aiming for student mastery, than some items with p values above .90 may be warranted. In contrast, if an instructor is mainly interested in differences among students, these items may not be worth testing. P-values below 0.20 are very difficult items and should be reviewed for possible confusing language, removed from subsequent exams, and/or identified as an area for re-instruction. If almost all of the students get the item wrong, there is either a problem with the item or students were not able to learn the concept. However, if an instructor is trying to determine the top percentage of students that learned a certain concept, this highly difficult item may be necessary. Item discrimination: the relationship between how well students did on the item and their total exam score. Also referred to as the Point-Biserial correlation (PBS) The range is from –1.00 to 1.00. The higher the value, the more discriminating the item. A highly discriminating item indicates that the students who had high exams scores got the item correct whereas students who had low exam scores got the item incorrect. Items with discrimination values near or less than zero should be removed from the exam. This indicates that students who overall did poorly on the exam did better on that item than students who overall did well. The item may be confusing for your better scoring students in some way.
  20. 20. Acceptable range: 0.20 or higher Ideal value: The closer to 1.00 the better Calculation: where Χ C = the mean total score for persons who have responded correctly to the item Χ Τ = the mean total score for all personsp = the difficulty value for the item q = (1 – p) S. D. Total = the standard deviation of total exam scores Reliability coefficient: a measure of the amount of measurement error associated with a exam score. The range is from 0.0 to 1.0. The higher the value, the more reliable the overall exam score. Typically, the internal consistency reliability is measured. This indicates how well the items are correlated with one another. High reliability indicates that the items are all measuring the same thing, or general construct (e.g. knowledge of how to calculate integrals for a Calculus course). With multiple-choice items that are scored correct/incorrect, the Kuder-Richardson formula 20 (KR20) is often used to calculate the internal consistency reliability. o K = number of items p = proportion of persons who responded correctly to an item (i.e., difficulty value) q = proportion of persons who responded incorrectly to an item (i.e., 1 – p) σ 2 x = total score variance Three ways to improve the reliability of the exam are to 1) increase the number of items in the exam, 2) use items that have high discrimination values in the exam, 3) or perform an item-total statistic analysis Acceptable range: 0.60 or higher Ideal value: 1.00 Item-total statistics: measure the relationship of individual exam items to the overall exam score. Currently, the University of Texas does not perform this analysis for faculty. However, one can calculate these statistics using SPSS or SAS statistical software. 1. Corrected item-total correlation o This is the correlation between an item and the rest of the exam, without that item considered part of the exam. o If the correlation is low for an item, this means the item isn't really measuring the same thing the rest of the exam is trying to measure. 2. Squared multiple correlation o This measures how much of the variability in the responses to this item can be predicted from the other items on the exam. o If an item does not predict much of the variability, then the item should be considered for deletion. 3. Alpha if item deleted o The change in Cronbach's alpha if the item is deleted. o When the alpha value is higher than the current alpha with the item included, one should consider deleting this item to improve the overall reliability of the exam. EXAMPLE Item-total statistic table
  21. 21. Item-total statistics Variable Summary for scale: Mean = 46.1100 S.D. = 8.26444 Valid n = 100 Cronbach alpha = .794313 Standardized alpha = .800491 Average inter-item correlation = .297818 Mean if deleted Var. if deleted S.D. if deleted Corrected item-total Correlation Squared multiple correlation Alpha if deleted ITEM1 41.61000 51.93790 7.206795 .656298 .507160 .752243 ITEM2 41.37000 53.79310 7.334378 .666111 .533015 .754692 ITEM3 41.41000 54.86190 7.406882 .549226 .363895 .766778 ITEM4 41.63000 56.57310 7.521509 .470852 .305573 .776015 ITEM5 41.52000 64.16961 8.010593 .054609 .057399 .824907 ITEM6 41.56000 62.68640 7.917474 .118561 .045653 .817907 ITEM7 41.46000 54.02840 7.350401 .587637 .443563 .762033 ITEM8 41.33000 53.32110 7.302130 .609204 .446298 .758992 ITEM9 41.44000 55.06640 7.420674 .502529 .328149 .772013 ITEM10 41.66000 53.78440 7.333785 .572875 .410561 .763314 By investigating the item-total correlation, we can see that the correlations of items 5 and 6 with the overall exam are . 05 and .12, while all other items correlate at .45 or better. By investigating the squared-multiple correlations, we can see that again items 5 and 6 are significantly lower than the rest of the items. Finally, by exploring the alpha if deleted, we can see that the reliability of the scale (alpha) would increase to .82 if either of these two items were to be deleted. Thus, we would probably delete these two items from this exam. Deleting item process: To delete these items, we would delete one item at a time, preferably item 5 because it can produce a higher exam reliability coefficient if deleted, and re-run the itemtotal statistics report before deleting item 6 to ensure we do not lower the overall alpha of the exam. After deleting item 5, if item 6 still appears as an item to delete, then we would re-perform this deletion process for the latter item. Distractor evaluation: Another useful item review technique to use. The distractor should be considered an important part of the item. Nearly 50 years of research shows that there is a relationship between the distractors students choose and total exam score. The quality of the distractors influence student performance on an exam item. Although the correct answer must be truly correct, it is just as important that the distractors be incorrect. Distractors should appeal to low scorers who have not mastered the material whereas high scorers should infrequently select the distractors. Reviewing the options can reveal potential errors of judgment and inadequate performance of distractors. These poor distractors can be revised, replaced, or removed. One way to study responses to distractors is with a frequency table. This table tells you the number and/or percent of students that selected a given distractor. Distractors that are selected by a few or no students should be removed or replaced. These kinds of distractors are likely to be so implausible to students that hardly anyone selects them.
  22. 22. Definition: The incorrect alternatives in a multiple-choice item. Reported as: The frequency (count), or number of students, that selected each incorrect alternative Acceptable Range: Each distractor should be selected by at least a few students Ideal Value: Distractors should be equally popular Interpretation: o Distractors that are selected by a few or no students should be removed or replaced o One distractor that is selected by as many or more students than the correct answer may indicate a confusing item and/or options The number of people choosing a distractor can be lower or higher than the expected because: o Partial knowledge o Poorly constructed item o Distractor is outside of the area being tested