More Related Content


Item analysis and validation

  1. ITEM ANALYSIS AND VALIDATION Mark Leonard Tan VerenaGonzales AnnCreiaTupasi Ramil Cabañesas
  2. Introduction The teacher normally prepares a draft of the test. Such a draft is subjected to item analysis and validation to ensure that the final version of the test would be useful and functional.
  3. Phases of preparing a test  Try-out phase  Item analysis phase  Item revision phase
  4. Item Analysis  There are two important characteristics of an item that will be of interest of the teacher:  Item Difficulty  Discrimination Index
  5.  Item Difficulty or the difficulty of an item is defined as the number of students who are able to answer the item correctly divided by the total number of students.Thus: Item difficulty = number of students with the correct answer Total number of students The item difficulty is usually expressed in percentage.
  6. Example: What is the item difficulty index of an item if 25 students are unable to answer it correctly while 75 answered it correctly? Here the total number of students is 100, hence, the item difficulty index is 75/100 or 75%.
  7. One problem with this type of difficulty index is that it may not actually indicate that the item is difficult or easy. A student who does not know the subject matter will naturally be unable to answer the item correctly even if the question is easy. How do we decide on the basis of this index whether the item is too difficult or too easy?
  8. Range of difficulty index Interpretation Action 0 – 0.25 Difficult Revise or discard 0.26 – 0.75 Right difficulty retain 0.76 - above Easy Revise or discard
  9.  Difficult items tend to discriminate between those who know and those who does not know the answer.  Easy items cannot discriminate between those two groups of students.  We are therefore interested in deriving a measure that will tell us whether an item can discriminate between these two groups of students. Such a measure is called an index of discrimination.
  10. An easy way to derive such a measure is to measure how difficult an item is with respect to those in the upper 25% of the class and how difficult it is with respect to those in the lower 25% of the class. If the upper 25% of the class found the item easy yet the lower 25% found it difficult, then the item can discriminate properly between these two groups.Thus:
  11. Index of discrimination = DU – DL Example: Obtain the index of discrimination of an item if the upper 25% of the class had a difficulty index of 0.60 (i.e. 60% of the upper 25% got the correct answer) while the lower 25% of the class had a difficulty index of 0.20.
  12. DU = 0.60 while DL = 0.20, thus index of discrimination = .60 - .20 = .40.
  13.  Theoretically, the index of discrimination can range from -1.0 (when DU =0 and DL = 1) to 1.0 (when DU = 1 and DL = 0)  When the index of discrimination is equal to -1, then this means that all of the lower 25% of the students got the correct answer while all of the upper 25% got the wrong answer. In a sense, such an index discriminates correctly between the two groups but the item itself is highly questionable.
  14.  On the other hand, if the index discrimination is 1.0, then this means that all of the lower 25% failed to get the correct answer while all of the upper 25% got the correct answer.This is a perfectly discriminating item and is the ideal item that should be included in the test.  As in the case of index difficulty, we have the following rule of thumb:
  15. Index Range Interpretation Action -1.0 to -.50 Can discriminate but the item is questionable Discarded -.55 to .45 Non-discriminating Revised .46 to 1.0 Discriminating item Include
  16. Example: Consider a multiple item choice type of test with the ff. data were obtained: Item Options 1 A B* C D 0 40 20 20 Total 0 15 5 0 Upper 25% 0 5 10 5 Lower 25% The correct response is B. Let us compute the difficulty index and index of discrimination.
  17. Difficulty index = no. of students getting the correct answer Total = __40__ 100 = 40%, within of a “good item” The correct response is B. Let us compute the difficulty index and index of discrimination:
  18. The discrimination index can be similarly be computed: DU = no. of students in the upper 25% with correct response No. of students in the upper 25% =15/20 = .75 or 75% DL= no. of students in lower 75% with correct response no. of students in the lower 25% = 5/20 = .25 or 25% Discrimination index = DU – DL = .75 - .25 = .50 or 50% Thus, the item also has a “good discriminating power”.
  19. It is also instructive to note that the distracter A is not an effective distracter since this was never selected by the students. Distracter C and D appear to have a good appeal as distracters.
  20. Basic Item Analysis Statistics The Michigan State University Measurement and Evaluation Department reports a number of item statistics which aid in evaluating the effectiveness of an item. Index of Difficulty – the proportional of the total group who got the item wrong. “Thus a high index indicates a difficult item and a low index indicates an easy item.
  21. Index of Discrimination – is the difference between the proportion of the upper group who got an item right and the proportion of the lower group who got the item right.
  22. More Sophisticated Discrimination Index  Item Discrimination refers to the ability of an item to differentiate among students on the basis of how well they know the material being tested.  A good item is one that has good discriminating ability and has a sufficient level of difficulty (not too difficult nor too easy).
  23.  At the end of the item analysis report, test items are listed according to their degrees of difficulty (easy, medium, hard) and discrimination (good, fair, poor).These distributions provide a quick overview of the test and can be used to identify items which are not performing well and which perhaps be improved or discarded.
  24. The Item-Analysis Procedure for Norm provides the following information: 1. The difficulty of an item 2. The discriminating power of an item 3. The effectiveness of each alternative
  25. Benefits derived from Item Analysis 1. It provides useful information for class discussion of the test. 2. It provides data which helps students improve their learning. 3. It provides insights and skills that lead to the preparation of better tests in the future.
  26. Index of Difficulty 
  27. Index of Item Discriminating Power 
  28. The discriminating power of an item is reported as a decimal fraction; maximum discriminating power is indicated by an index of 1.00. Maximum discrimination is usually found at the 50 per cent level of difficulty. 0.00 – 0.20 = very difficult 0.21 – 0.80 = moderately difficult 0.81 – 1.00 = very easy
  29. Validation  After performing the item analysis and revising the items which need revision, the next step is to validate the instrument.  The purpose of validation is to determine the characteristics of the whole test itself, namely, the validity and reliability of the test.  Validation is the process of collecting and analysing evidence to support the meaningfulness and usefulness of the test.
  30. Validity  is the extent to which measures what it purports to measure or referring to the appropriateness, correctness, meaningfulness, and usefulness of the specific decisions a teacher makes based on the test results.
  31. There are three main types of evidences that may be collected: 1. Content-related evidence of validity 2. Criterion-related evidence of validity 3. Construct-related evidence of validity
  32. Content-related evidence of validity  refers to the content and format of the instrument.  How appropriate is the content?  How comprehensive?  Does it logically get at the intended variable?  How adequately does the sample of items or questions represent the content to be assessed?
  33. Criterion-related evidence of validity  refers to the relationship between scores obtained using the instrument and scores obtained using one or more other test (often called criterion).  How strong is this relationship?  How well do such scores estimate present or predict future performance of a certain type?
  34. Construct-related evidence of validity  refers to the nature of the psychological construct or characteristic being measured by the test.  How well does a measure of the construct explain differences in the behaviour of the individuals or their performance on a certain task?
  35. Usual procedure for determining content validity  Teacher write out objectives based onTOS  Gives the objectives andTOS to 2 experts along with a description of the test takers.  The experts look at the objectives, read over the items in the test and place a check mark in front of each question or item that they feel does NOT measure one or more objectives.
  36. Usual procedure for determining content validity  They also place a check mark in front of each objective NOT assessed by any item in the test.  The teacher then rewrites any item so checked and resubmits to experts and/or writes new items to cover those objectives not heretofore covered by the existing test.
  37. Usual procedure for determining content validity  This continues until the experts approve all items and also when the experts agree that all of the objectives are sufficiently covered by the test.
  38. Obtaining Evidence for criterion-related Validity  The teacher usually compare scores on the test in question with the scores on some other independent criterion test which presumably has already high validity (concurrent validity).  Another type of validity is called the predictive validity wherein the test scores in the instrument is correlated with scores on later performance of the feelings.
  39. Gronlunds Expectancy Table Grade Point Average Test Score Very Good Good Needs Improvement High 20 10 5 Average 10 25 5 Low 1 10 14
  40.  The expectancy table shows that there were 20 students getting high test scores and subsequently rated excellent in terms of their final grades;  And finally 14 students obtained low test scores and were later graded as needing improvement.
  41.  The evidence for this particular test tends to indicate that students getting high score on it would be graded excellent; average scores on it would be rated good later; and students getting low scores on the test would be graded needing improvement later.
  42. Reliability  Refers to the consistency of the scores obtained – how consistent they are for each individual from one administration of an instrument to another and from one set of items to another.
  43.  We already have the formulas for computing the reliability of a test; for internal consistency, for instance, we could use the split-half method or the Kuder-Richardson formulae: KR-20 or KR-21
  44.  Reliability and validity are related concepts. If an instrument is unreliable, it cannot yet valid outcomes.  As reliability improves, validity may improve (or may not).  However, if an instrument is shown scientifically to be valid then it is almost certain that it is also reliable.
  45.  The ff. table is a standard followed by almost universally in educational tests and measurement: Reliability Interpretation .90 and above Excellent reliability; at the level of the best standardized tests. .80 - .90 Very good for a classroom test .70 - .80 Good for a classroom test; in the range of most.There are probably a few items which could be improved. .60 - .70 Somewhat low.This test should be supplemented by other measures (e.g., more test) for grading. .50 - .60 Suggests need for revision of test, unless it is quite short (ten or fewer items).The test definitely needs to be supplemented by other measures (e.g., more tests) for grading. .50 or below Questionable reliability.This test should not contribute heavily to the course grade, and it needs revision.

Editor's Notes

  1. Why should the bright ones get the wrong answer while the poor ones got the right answer?