A. RELIABILITYA. RELIABILITYCHARACTERISTICS OF ACHARACTERISTICS OF AGOOD TESTGOOD TEST
ReliabilityReliability• Reliability is synonymous with consistency. It isthe degree to which test scores for an individualtest taker or group of test takers are consistentover repeated applications.• No psychological test is completely consistent,however, a measurement that is unreliable isworthless.
Would you keep using thesemeasurement tools?The consistency of test scores is criticallyimportant in determining whether a testcan provide good measurement.
When someone says you are a‘reliable’ person, what do they reallymean?Are you a reliable person?
Reliability (cont.)Reliability (cont.)* Because no unit of measurement is exact, any time youmeasure something (observed score), you are reallymeasuring two things.1. True Score - the amount of observed score that trulyrepresents what you are intending to measure.2. Error Component - the amount of other variables thatcan impact the observed scoreObserved Test Score = True Score + Errors ofMeasurement
Measurement ErrorMeasurement Error• Any fluctuation in test scores that results fromfactors related to the measurement process thatare irrelevant to what is being measured.• The difference between the observed score andthe true score is called the error score. S true = Sobserved - S error
Measurement Error is Reduced By:- Writing items clearly- Making instructions easily understood- Adhering to proper test administration- Providing consistent scoring
Determining ReliabilityDetermining Reliability• There are several ways that measuring reliability can bedetermined, depending on the type of measurement thesupporting data required. They include:- Internal Consistency- Test-retest Reliability- Inter rater Reliability- Split-half Methods- Odd-even Reliability- Alternate Forms Methods
Internal ConsistencyInternal Consistency• Measures the reliability of a test solely on the number ofitems on the test and the inter correlation among theitems. Therefore, it compares each item to every otheritem.Cronbach’s Alpha: .80 to .95 (Excellent).70 to .80 (Very Good).60 to .70 (Satisfactory)<.60 (Suspect)
Split Half & Odd-Even ReliabilitySplit Half & Odd-Even ReliabilitySplit Half - refers to determining a correlation between the firsthalf of the measurement and the second half of the measurement(i.e., we would expect answers to the first half to be similar to thesecond half).Odd-Even - refers to the correlation between even items and odditems of a measurement tool.• In this sense, we are using a single test to create two tests,eliminating the need for additional items and multipleadministrations.• Since in both of these types only 1 administration is needed andthe groups are determined by the internal components of the test,it is referred to as an internal consistency measure.
Split-half reliability[error due to differences in item content between the halves of the test]• Typically, responses on odd versus even items are employed• Correlate total scores on odd items with the scores obtainedon even itemsPerson Odd Even1 36 432 44 403 42 374 33 40110050pairs
Test-retest ReliabilityTest-retest Reliability• Test-retest reliability is usually measured by computingthe correlation coefficient between scores of twoadministrations.
Test-retest Reliability (cont.)Test-retest Reliability (cont.)• The amount of time allowed between measures is critical.• The shorter the time gap, the higher the correlation; the longerthe time gap, the lower the correlation. This is because the twoobservations are related over time.• Optimum time between administrations is 2 to 4 weeks.• The rationale behind this method is that the difference betweenthe scores of the test and the retest should be due to measurementsolely.
Inter rater ReliabilityInter rater Reliability• Whenever you use humans as a part of your measurementprocedure, you have to worry about whether the results you getare reliable or consistent. People are notorious for theirinconsistency. We are easily distractible. We get tired of doingrepetitive tasks. We daydream. We misinterpret.
Inter rater Reliability (cont.)Inter rater Reliability (cont.)• For some scales it is important to assess interraterreliability.• Interrater reliability means that if two different ratersscored the scale using the scoring rules, they shouldattain the same result.• Interrater reliability is usually measured by computingthe correlation coefficient between the scores of tworaters for the set of respondents.• Here the criterion of acceptability is pretty high (e.g., acorrelation of at least .9), but what is consideredacceptable will vary from situation to situation.
Parallel/Alternate Forms MethodParallel/Alternate Forms MethodParallel/Alternate Forms Method - refers to theadministration of two alternate forms of thesame measurement device and then comparingthe scores.• Both forms are administered to the same person andthe scores are correlated. If the two produce thesame results, then the instrument is consideredreliable.
Parallel/Alternate Forms Method (cont.)Parallel/Alternate Forms Method (cont.)• A correlation between these two forms is computed justas the test-retest method.Advantages• Eliminates the problem of memory effect.• Reactivity effects (i.e., experience of taking the test) arealso partially controlled.
Factors Affecting ReliabilityFactors Affecting Reliability• Administrator Factors• Number of Items on the instrument• The Instrument Taker• Heterogeneity of the Items• Heterogeneity of the Group Members• Length of Time between Test and Retest
How High Should Reliability Be?How High Should Reliability Be?• A highly reliable test is always preferable to a test withlower reliability.. 80 > greater (Excellent).70 to .80 (Very Good).60 to .70 (Satisfactory)<.60 (Suspect)• A reliability coefficient of .80 indicates that 20% of thevariability in test scores is due to measurement error.
Reliability deals with the consistency.Reliability is the quality that guarantees us thatwe will get similar results when conducting thesame test on the same population every time.Consider this ruler…
VALIDITYValidity deals with theaccuracy of themeasurement
Validity Depends on the PURPOSE E.g. a ruler may be a valid measuring devicefor length, but isn’t very valid for measuringvolume Measuring what ‘it’ is supposed to Matter of degree (how valid?) Specific to a particular purpose! Learning outcomes1. Content coverage (relevance?)2. Level & type of student engagement(cognitive, affective, psychomotor) –appropriate?
Types of validity measures Face validity Construct validity Content validity Criterion validity
Face ValidityDoes it appear to measure what it is supposed to measure?Example: Let’s say you are interested in measuring,‘Propensity towards violence and aggression’. By simplylooking at the following items, state which ones qualify tomeasure the variable of interest: Have you been arrested? Have you been involved in physical fighting? Do you get angry easily? Do you sleep with your socks on? Is it hard to control your anger? Do you enjoy playing sports?
Construct Validity Does the test measure the ‘human’ theoretical constructor trait. Examples Mathematical reasoning Verbal reasoning or fluency Musical ability Spatial ability Motivation Applicable to authentic assessment Each construct is broken down into its component parts E.g. ‘motivation’ can be broken down to: Interest Attention span Hours spent Assignments undertaken and submitted, etc.All of these sub-constructs put together – measure ‘motivation’
Content ValidityHow well elements of the test relate to the contentdomain?How closely content of questions in the test relates tocontent of the curriculum?Directly relates to instructional objectives and thefulfillment of the same!Major concernfor achievement tests (where content isemphasized)Can you test students on things they have not beentaught?
How to establish Content Validity? Instructional objectives (looking at your list) Table of Specification E.g. At the end of the chapter, the student will be able todo the following:1. Explain what ‘stars’ are2. Discuss the type of stars and galaxies in our universe3. Categorize different constellations by looking at the stars4. Differentiate between our stars, the sun, and all other stars
Categories of Performance (MentalSkills)Content areasKnowledge Comprehension Analysis Total1. What are‘stars’?2. Our star, theSun3. Constellations4. GalaxiesTotal GrandTotalTable of Specification (An Example)
Criterion ValidityThe degree to which content on a test (predictor)correlates with performance on relevant criterionmeasures (concrete criterion in the "real" world?)If they do correlate highly, it means that the test(predictor) is a valid one!E.g. if you taught skills relating to ‘public speaking’ andhad students do a test on it, the test can be validated bylooking at how it relates to actual performance (publicspeaking) of students inside or outside of theclassroom
Factors that can lower Validity Unclear directions Difficult reading vocabulary and sentence structure Ambiguity in statements Inadequate time limits Inappropriate level of difficulty Poorly constructed test items Test items inappropriate for the outcomes being measured Tests that are too short Improper arrangement of items (complex to easy?) Identifiable patterns of answers Teaching Administration and scoring Students Nature of criterion
Validity and ReliabilityNeither Validnor ReliableReliable but notValidValid & ReliableFairly Valid butnot very ReliableThink in terms of ‘thepurpose of tests’ and the‘consistency’ with whichthe purpose isfulfilled/met
Objectivitythe state of being fair, without bias or externalinfluence.if the test is marked by different people, thescore will be the same . In other words, markingprocess should not be affected by the markingpersons personality.Not inﬂuenced by emotion or personalprejudice. Based on observable phenomena;presented factually: an objective appraisal.The questions and answers should be clear
measures an individuals characteristics in away that is independent of rater’s bias or theexaminers own beliefsgauges the test takers conscious thoughts andfeelings without regard to the test administratorsbeliefs or biases.help greatly in determining the test takerspersonality.
Understanding Normsa list of scores and corresponding percentile ranks,standard scores, or other transformed scores of agroup of examinees on whom a test wasstandardized.In a psychometric context, norms are the testperformance data of a particular group of test takersthat are designed for use as a reference for evaluatingor interpreting individual test scores” (Cohen &Swerdlik, 2002, p. 100).
TYPES OF NORMS•Percentiles- refer to a distribution divided into 100equal parts.- refer to the score at or below which aspecific percentage of scores fall.Ex. A student got 90% rank of NATexam. What does this mean?
It means that 90% of hisclassmates scored lower thanhis score or 10% of hisclassmates got score above hisscore.
Age Norms (age-equivalent scores)–“indicate the average performance ofdifferent samples of test takers who were atvarious ages at the time the test wasadministered” (Cohen & Swerdlik, 2002, p.105).Grade Norms–Used to indicate the average testperformance of testtakers in a specific grade.–Based on a ten month scale, refers to gradeand month (e.g., 7.3 is equivalent to seventhgrade, third month).
•National Norms–Derived from a standardization sample nationallyrepresentative of the population of interest.Subgroup Norms–Are created when narrowly defined groups aresampled.Ex. •Socioeconomic status•Handedness•Education level
Local Norms–Are derived from the local population’s performanceon a measure.- Typically created locally (i.e., by guidance counselor,personnel director, etc.)Fixed Reference Group Scoring Systems•Calculation of test scores is based on a fixedreference group that was tested in the past.
•Norm referenced tests consider theindividual’s score relative to the scores oftesttakers in the normative sample.•Criterion Referenced tests consider theindividual’s score relative to a specifiedstandard or criterion (cut score).–Licensure exams–Proficiency tests
Item AnalysisA name given to a variety of statistical techniquesdesigned to analyze individual items on a testIt involves examining class-wide performance onindividual test items.It sometimes suggests why an item has notfunctioned effectively and how it might beimprovedA test composed of items revised and selected onthe basis of item-analysis is almost certain to bemore reliable than the one composed of an equalnumber of untested items.
Difficulty indexThe proportion of students in class who gotan item correct. The larger the proportion ,the more students who have learned thecontent measured by the item
Discrimination indexA basic measure of the validity of an item.A measure of an item’s ability todiscriminate between those who scored highon the total test and those who scored low.It can be interpreted as an indication of theextent to which overall knowledge of thecontent area or mastery of the skill is relatedto the response on an item
Analysis of response options/distracteranalysisIn addition to examining the performance of a testitem, teachers are often interested in examiningthe performance of individual distracters( incorrect answer options) on multiple-choiceitemsBy calculating the proportion of students whochose each answer option, teachers can identifywhich distracters are working and appear to beattractive to students who do not know the correctanswer, and which distracters are simply taking upspace and not being chosen by many students
To eliminate blind guessing whichresults in a correct answer purely bychance (which hurts the validity of atest item), teachers want as manyplausible distracters as is feasible.
The process of item analysis1. Arrange the test scores from highest to lowest2. Select the criterion groupsIdentify a High group and a Low group. The Highgroup is the highest-scoring 27% of the group and the Lowgroup is the lowest scoring 27%27% of the examinees is called the criterion group. Itprovides the best compromise between two desirable butinconsistent aims:to make the extreme groups as large as possibleand as different as possiblethen we can say with confidence that those in the Highgroup are superior in the ability measured by the test thanthose in the Low group.
3. For each item, count the number ofexaminees in the High group who have correctresponses. Do a separate, similar procedure for thelow group4. Solve for the difficulty index of each item The larger the value of the index, the easier the item. The smaller the value, the more difficult is the item. Scale for interpreting the difficulty index of an itemBelow 0.25 item is very difficult0.25 – 0.75 item is of average difficultyor item is rightly difficultAbove 0.75 item is very easy
Example: Item analysis1. Count and arrange the scores from highest tolowest. Ex. n=43 scores2. Calculate the criterion group (N) which is 27% ofthe total number of scores. Ex. N=27% of 43= (0.27)(43) = 123. Take 12 scores from the highest down and take 12scores from the lowest up, call these High group andLow group respectively.4. Tabulate the number of responses for each optionsfrom the high and low groups for that particular itemunder analysis.
5. Solve for the difficulty index of each item The larger the value of the index, the easier theitem. The smaller, the more difficult. Scale for interpreting the difficulty index of anitemBelow 0.25 item is very difficult0.25 – 0.75 item is of average difficulty oritem is rightly difficultAbove 0.75 item is very easy
A B C D* E TotalUpperGroup1 1 0 9 1 12LowerGroup3 1 4 4 0 12Ex: Item # 5 of the Multiple Choice test, D is the correctoption.
Idis Index Description Interpretation0.40 – 1.0 High The item is verygood0.30 -0.39 Moderate Reasonably good,can be improved0.20 – 0.29 Moderate In need ofimprovement< 0.20 Low Poor, to bediscardedThe following can be used to interpret theindex of discrimination.
Idis Idif Item categoryHigh Easy GoodHigh Easy/difficult FairModerate Easy/difficult FairHigh/moderate Easy/difficult Fairlow At any level Poor (Discard theitem)•Interpreting the results by giving value judgment
Index of difficulty = (Hc + Lc) / 2N =(9+4)/2(12)=.54 ----the item is rightlydifficultIndex of discrimination = (Hc –Lc)/N=(9-4)/12=.42---- high index of discrimination---- the item has the power todiscriminateHence, item number 5 has to beretained.Distracter analysis: A and C are gooddistracters