Reliability and validity


Published on

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Reliability and validity

  1. 1. What is test validity and test validation? Tests themselves are not valid or invalid. Instead, we validate the use of a test score.
  2. 2. <ul><li>Tests are pervasive in our world. Tests can take the form of written responses to a series of questions, such as the paper-and-pencil SAT Reasoning Test™, or of judgments by experts about behavior, such as those for gymnastic trials or for a work performance appraisal. The form of test results also vary from pass/fail, to holistic judgments, to a complex series of numbers meant to convey minute differences in behavior. </li></ul>
  3. 3. <ul><li>Regardless of the form a test takes, its most important aspect is how the results are used and the way those results impact individual persons and society as a whole. Tests used for admission to schools or programs or for educational diagnosis not only affect individuals, but also assign value to the content being tested. A test that is perfectly appropriate and useful in one situation may be inappropriate or insufficient in another. For example, a test that may be sufficient for use in educational diagnosis may be completely insufficient for use in determining graduation from high school. </li></ul>
  4. 4. <ul><li>Test validity, or the validation of a test, explicitly means validating the use of a test in a specific context, such as college admission or placement into a course. Therefore, when determining the validity of a test, it is important to study the test results in the setting in which they are used. In the previous example, in order to use the same test for educational diagnosis as for high school graduation, each use would need to be validated separately, even though the same test is used for both purposes. </li></ul>
  5. 5. Validity is a matter of degree, not all or none. <ul><li>Samuel Messick, a renowned psychometrician, defines validity as &quot; integrated evaluative judgment of the degree to which empirical evidence and theoretical rationale support the adequacy and appropriateness of inferences and actions based on test scores and other modes of assessment.&quot; Messick points out that validity is a matter of degree, not absolutely valid or absolutely invalid. He advocates that, over time, validity evidence will continue to gather, either enhancing or contradicting previous findings. </li></ul>
  6. 6. Tests sample behavior; they don't measure it directly. <ul><li>Most, but not all, tests are designed to measure skills, abilities, or traits that are and are not directly observable. For example, scores on the SAT Reasoning Test measure developed critical reading, writing and mathematical ability. The score on the SAT Reasoning Test that an examinee obtains when he or she takes the test is not a direct measure of critical reading ability, such as degrees centigrade are a direct measure of the heat of an object. The amount of an examinee's developed critical reading ability must be inferred from the examinee's SAT Reasoning Test critical reading score. </li></ul>
  7. 7. <ul><li>The process of using a test score as a sample of behavior in order to draw conclusions about a larger domain of behaviors is characteristic of most educational and psychological tests. Responsible test developers and publishers must be able to demonstrate that it is possible to use the sample of behaviors measured by a test to make valid inferences about an examinee's ability to perform tasks that represent the larger domain of interest. </li></ul>
  8. 8. Reliability is not enough; a test must also be valid for its use. <ul><li>If test scores are to be used to make accurate inferences about an examinee's ability, they must be both reliable and valid. Reliability is a prerequisite for validity and refers to the ability of a test to measure a particular trait or skill consistently. However, tests can be highly reliable and still not be valid for a particular purpose. Crocker and Algina (1986, page 217) demonstrate the difference between reliability and validity with the following analogy. </li></ul>
  9. 9. <ul><li>Consider the analogy of a car's fuel gauge which systematically registers one-quarter higher than the actual level of fuel in the gas tank. If repeated readings are taken under the same conditions, the gauge will yield consistent (reliable) measurements, but the inference about the amount of fuel in the tank is faulty. </li></ul>
  10. 10. This analogy makes it clear that determining the reliability of a test is an important first step, but not the defining step, in determining the validity of a test.
  11. 11. Reliability Is the degree to which an assessment tool produces stable and consistent results.
  12. 12. Test reliability <ul><li>Researchers use four methods to check the reliability of a test: the test-retest method, alternate forms, internal consistency, and inter-scorer reliability. Not all of these methods are used for all tests. Each method provides research evidence that the responses are consistent under certain circumstances. </li></ul>
  13. 13. Test-retest reliability <ul><li>Is a measure of reliability obtained by administering the same test twice over a period of time to a group of individuals. </li></ul><ul><li>Example: A test designed to assess student learning in psychology could be given to a group of students twice, with the second administration perhaps coming a week after the first. The obtained correlation coefficient would indicate the stability of the scores. </li></ul>
  14. 14. <ul><li>Example : The information obtained for test-retest reliability of the WISC-IV was evaluated with information from 243 children. The WISC-IV was administered two separate times with the test-retest mean interval of 32 days. The average corrected Full Scale IQ stability coefficient was .93. </li></ul><ul><li>Show some computation here using excel </li></ul>
  15. 15. Correlation <ul><li>The term correlation simply refers to the degree to which two or more sets of data show a tendency to vary together. The strength of the positive (increase, increase) correlation coefficient can vary from .00 to 1.00. </li></ul>
  16. 16. Determining the Correlation Coefficient <ul><li>Step 1 :      List scores. Find the mean (M) and Standard Deviation (SD) for each set of scores. </li></ul><ul><li>Participant #         1st Scores      2nd Scores </li></ul><ul><li>   1                            4                     6    2                            6                     7    3                            2                     3    4                            2                     4    5                            4                     2    6                            4                     7    7                            9                     9    8                            3                     5    9                            3                     6    10                          3                     5 </li></ul><ul><li>                   M   1  = 4               M 2   = 5.4 </li></ul><ul><li>                       SD 1   = 2            SD 2 = 1.96 </li></ul><ul><li>                       n 1 = 10                n 2 = 10 </li></ul>
  17. 17. <ul><li>Step 2 : Find the z score for each test score using the formula: </li></ul><ul><li>(X - M)/ SD </li></ul><ul><li>    Z scores are a type of standard score. The z score is useful when attempting to compare items from distributions with different means and standard deviations. The z score for a test score indicates how far and in what direction that test score is from its distribution's mean, expressed in units of its distribution's standard deviation. The z scores will have a mean of zero and a standard deviation of one. </li></ul>
  18. 18. <ul><li>Participant #     lst Test Z Score     2nd Test Z Score </li></ul><ul><li>    1                        0                             .31     2                        1                             .82     3                       -1                          -1.22     4                       -1                            -.71     5                        0                          -1.73     6                        0                              .82     7                        2.5                        1.84     8                       -0.5                         -.20     9                       -0.5                          .31    10                      -0.5                        -.20 </li></ul>
  19. 19. <ul><li>Step 3 :      Sum the scores from each test. Apply scores to the defining formula for the Pearson r:    </li></ul><ul><li>r = Σ ZxZy / N </li></ul>
  20. 20. <ul><li>Participant #      ZxZy     1.                     0     2.                       .82     3.                     1.22     4.                        .71     5.                     0     6.                    0      7.                      4.60     8.                        .10     9.                       -.16     10.                      .10 ∑ZxZy  =    7.39 </li></ul><ul><li>  (∑ ZxZy) / 10 = 0.74* </li></ul><ul><li>*This answer is the correlation coefficient </li></ul>
  21. 21. About Pearson r <ul><li>The Pearson r is the most commonly used measure of correlation, sometimes called the Pearson Product Moment correlation. It is simply the average of the sum of the Z score products and it measures the strength of linear relationship between two characteristics. The positive (increase, increase) correlation coefficient can range from 0.00 to 1.00. The closer to 1.00, the stronger the relationship.  </li></ul>
  22. 22. Alternate Forms <ul><li>this type of reliability makes a second form of a test consisting of similar items, but not the same items. Researchers administer this second “parallel” form of a test after having already administered the first form. This allows researchers to determine a reliability coefficient that reflects error due to different times and items and allow to control for test form. By administering form A to one group and form B to another group, and then form B to the first group and form A to the second group for the next administration of the test, researchers are able to find a coefficient of stability and equivalence. This is the correlation between scores on two forms and takes into account error of different times and forms. </li></ul>
  23. 23. <ul><li>Example: If you wanted to evaluate the reliability of a critical thinking assessment, you might create a large set of items that all pertain to critical thinking and then randomly split the questions up into two sets, which would represent the parallel forms. </li></ul>
  24. 24. Inter-rater ability <ul><li>Is a measure of reliability used to assess the degree to which different judges or raters agree in their assessment decisions. Inter-rater reliability is useful because human observers will not necessarily answers the same way; raters may disagree as to how well certain responses or material demonstrate knowledge of the construct* or skill being assessed. </li></ul><ul><li>Example: Inter-rater reliability might be employed when different judges are evaluating the degree to which portfolios meet certain standards. Inter-rater reliability is especially useful when judgments can be considered relatively subjective. Thus. The use of this type of reliability would probably be more likely when evaluating artwork as opposed to math problems. </li></ul><ul><li>* Construct is defined as a property that is offered to explain some aspect of human behavior, such as mechanical ability, intelligence or introversion </li></ul>
  25. 25. Internal consistency reliability <ul><li>Is a measure of reliability used to evaluate the degree to which different test items that probe the same construct produce similar results. </li></ul>
  26. 26. <ul><li>Split-half reliability- is a subtype of internal consistency reliability. The process of obtaining split-half reliability is begun by splitting in half all items of a test that are intended to probe the same area of knowledge (e.g. World War II) in order to form two “sets” of items. The entire test is administered to a group of individuals, the total score of each “set” is computed, and finally the split-half reliability is obtained by determining the correlation between the two total set of scores. </li></ul>
  27. 27. <ul><li>A test given and divided into halves and are scored separately, then the score of one half of test are compared to the score of the remaining half to test the reliability </li></ul>
  28. 28. Why use Split-Half? <ul><li>Split-Half Reliability is a useful measure when impractical or undesirable to assess reliability with two tests or to have two test administrations (because of limited time or money) (Cohen & Swerdlik, 2001). </li></ul>
  29. 29. How do I use Split-Half ? <ul><li>1st-divide test into halves . The most commonly used way to do this would be to assign odd numbered items to one half of the test and even numbered items to the other, this is called, Odd-Even reliability. </li></ul><ul><li>2nd - Find the correlation of scores between the two halves by using the Pearson r formula. </li></ul><ul><li>3rd - Adjust or reevaluate correlation using Spearman-Brown formula which increases the estimate reliability even more. The longer the test the more reliable it is so it is necessary to apply the Spearman-Brown formula to a test that has been shortened, as we do in split-half reliability (Kaplan & Saccuzzo, 2001). </li></ul>
  30. 30. <ul><li>  Spearman-Brown formula </li></ul><ul><li>r = 2 r 1+ r </li></ul><ul><li>r = estimated correlation between two halves (Pearson r) (Kaplan & Saccuzzo, 2001). </li></ul>
  31. 31. <ul><li>This method does not require 2 administrations of the same or an alternative form test. In the split halves method, it is not enough that we simply derived its correlation because it only estimates the reliability of each half of test. It is necessary to use a statistical correction to estimate reliability of the whole test. This correction is known as the spearman-brown prophecy. </li></ul>
  32. 32. <ul><li>If the correlation between the halves is .75, the reliability for the total test is </li></ul><ul><li>r = 2 r 1+ r </li></ul><ul><li>R = 2 (.75)/ (1+.75) = 1.5 / 1.75 = .857 </li></ul>
  33. 33. <ul><li>B. Average inter-item correlation- is a subtype of internal consistency reliability. It is obtained by taking all of the items on a test that probe the same construct (e.g.; reading comprehension), determining the correlation coefficient for each pair of items, and finally taking the average inter-item correlation. </li></ul>
  34. 34. <ul><li>measures the degree of agreement between persons scoring a subjective test (like an essay exam) or rating an individual. In regards to the latter, this type of reliability is most often used when scorers have to observe and rate the actions of participants in a study. This research method reveals how well the scorers agreed when rating the same set of things. Other names for this type of reliability are inter-rater reliability or inter observer reliability. </li></ul>
  35. 35. <ul><li>The KAPPA coefficients indicate the extent of agreement between the raters, after removing that part of their agreement that is attributable to chance. As can be seen, the values of the KAPPA statistic are much lower than the simple percentages of agreement </li></ul>
  36. 36. <ul><li>Cohen's kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories. The first mention of a kappa-like statistic is attributed to Galton (1892), see Smeeton (1985). </li></ul><ul><li>The equation for κ is: </li></ul><ul><li>where Pr( a ) is the relative observed agreement among raters, and Pr( e ) is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly saying each category. If the raters are in complete agreement then κ = 1. If there is no agreement among the raters (other than what would be expected by chance) then κ ≤ 0. </li></ul>
  37. 37. <ul><li>Suppose that you were analyzing data related to people applying for a grant. Each grant proposal was read by two people and each reader either said &quot;Yes&quot; or &quot;No&quot; to the proposal. Suppose the data were as follows, where rows are reader A and columns are reader B: </li></ul><ul><ul><li>Yes No </li></ul></ul><ul><li>Yes 20 5 </li></ul><ul><li>No 10 15 </li></ul>
  38. 38. <ul><li>Note that there were 20 proposals that were granted by both reader A and reader B, and 15 proposals that were rejected by both readers. Thus, the observed percentage agreement is Pr( a )=(20+15)/50 = 0.70. </li></ul><ul><li>To calculate Pr( e ) (the probability of random agreement) we note that: </li></ul><ul><li>Reader A said &quot;Yes&quot; to 25 applicants and &quot;No&quot; to 25 applicants. Thus reader A said &quot;Yes&quot; 50% of the time. </li></ul><ul><li>Reader B said &quot;Yes&quot; to 30 applicants and &quot;No&quot; to 20 applicants. Thus reader B said &quot;Yes&quot; 60% of the time. </li></ul>
  39. 39. <ul><li>Therefore the probability that both of them would say &quot;Yes&quot; randomly is 0.50*0.60=0.30 and the probability that both of them would say &quot;No&quot; is 0.50*0.40=0.20. Thus the overall probability of random agreement is Pr(&quot;e&quot;) = 0.3+0.2 = 0.5. </li></ul><ul><li>So now applying our formula for Cohen's Kappa we get: </li></ul>
  40. 40. Validity Refers to how well a test measures what it is purported to measure.
  41. 41. Why is it necessary? <ul><li>While reliability is necessary, it alone is not sufficient. For a test to be reliable, it also needs to be valid. </li></ul>
  42. 42. Types of Validity
  43. 43. Face Validity <ul><li>Ascertains that the measure appears to be assessing the intended construct under study. The stakeholders can easily assess face validity. Although this is not a very “scientific” type of validity, it may be an essential component in listing motivation of stakeholders. If the stakeholders do not believe the measure is an accurate assessment of the ability, they may become disengaged with task. </li></ul>
  44. 44. <ul><li>Ex: If a measure of art appreciation is created all of the items should be related to the different components and types of art. If the questions are regarding historical time periods, with no reference to any artistic movement, stakeholders may not be motivated to give their best effort or invest in this measure because they do not believe it is a true assessment of art appreciation. </li></ul>
  45. 45. Construct Validity <ul><li>Is used to ensure that the measure is actually measure what it is intended to measure and not the other variables. Using a panel of “experts” familiar with the construct is a way in which this type of validity can be assessed. The experts can examine the items and decide what that specific item is intended to measure. Students can be involved in this process to obtain their feedback. </li></ul>
  46. 46. <ul><li>Ex: A women’s studies program may design a cumulative assessment of learning throughout the major. The questions are written with complicated wording and phrasing. This can cause the test inadvertently becoming a test of reading comprehension, rather than a test of women’s studies. It is important that the measure is actually assessing the intended construct, rather than extraneous factor. </li></ul>
  47. 47. Criterion-Related Validity <ul><li>Is used to predict future or current performance – it correlates test results with another criterion of interest. </li></ul>
  48. 48. <ul><li>Ex: If a physics program designed a measure to assess cumulative student learning throughout the major. The new measure could be correlated with a standardized measure of ability in this discipline, such as an ETS field testr or the GRE subject test. The higher the correlation between the established measure and new measure, the more faith stakeholders can have in the new assessment tool. </li></ul>
  49. 49. Formative Validity <ul><li>When applied to outcomes assessment it is used to assess how well a measure is able to provide information to help improve the program under study. </li></ul>
  50. 50. <ul><li>Ex: When designing a rubric for history one could assess student’s knowledge across the discipline. If the measure can provide information that students are lacking knowledge in a certain area, for instance the Civil Rights Movements, then that assessment tool is providing meaningful information that can be used to improve the course of program requirements. </li></ul>
  51. 51. Sampling Validity <ul><li>Ensures that the measure covers the broad range of areas within the concept under study. Not everything can be covered, so items need to be sampled from all of the domains. This may need to be completed using a panel of “experts” to ensure that the content area is adequately sampled. Additionally, a panel can help limit “expert bias” (i.e. a test reflecting what an individual personally feels are the most important or relevant areas.) </li></ul>
  52. 52. <ul><li>Ex: When designing an assessment of learning in the theatre department, it would not be sufficient to only cover issues related to acting. Other areas of theatre such as lighting, sound, functions of stage managers should all be included. The assessment should reflect the content area in its entirety. </li></ul>
  53. 53. Some ways to improve validity <ul><li>Make sure your goals and objectives are clearly defined and operationalized. Expectations of students should be written down. </li></ul><ul><li>Match your assessment measure to your goals and objectives. Additionally, have the test reviewed by faculty at other schools to obtain feedback from an outside party who is less invested in the instrument. </li></ul><ul><li>Get students involved; have the students look over the assessment for troublesome wording, or other difficulties. </li></ul><ul><li>If possible, compare your measure with other measures, or data that may be available. </li></ul>
  54. 54. How do we use simple test analysis on our school.