2. Method of measuring test reliability
RELIABILITY:
Reliability in statistics and psychometrics is the overall consistency of a measure. A measure
is said to have a high reliability if it produces similar results under consistent conditions. For
example, measurements of people's height and weight are often extremely reliable.
Reliability is the degree to which an assessment tool produces stable and consistent results.
So we can say that Reliability is a measure of the consistency of a test.
Here are the some most common ways of measuring reliability for any empirical
method or metric:
inter-rater reliability
test-retest reliability
parallel forms reliability
internal consistency reliability
Split half method
METHODS OF MEASURINGRELIABILITY
3. 1.Inter-Rater or Inter-Observer Reliability-
This is used to assess the degree to which different raters or observers give consistent
estimates of the same phenomenon.
Whenever you use humans as a part of your measurement procedure, you have to worry
about whether the results you get are reliable or consistent. People are notorious for their
inconsistency. We are easily distractible. We get tired of doing repetitive tasks. We
daydream. We misinterpret.
For example, we found that the average inter-rater reliability] of usability experts rating the severity
of usability problems was r = .52. You can also measure intra-rater reliability, whereby you correlate
multiple scores from one observer. In that same study, we found that the average intra-rater reliability
when judging problem severity was r = .58 (which is generally low reliability).
Test-Retest Reliability
We estimate test-retest reliability when we administer the same test to the same sample on two
different occasions.
The shorter the time gap, the higher the correlation; the longer the time gap, the lower the correlation.
This is because the two observations are related over time
Do customers provide the same set of responses when nothing about their experience or their attitudes
has changed? You don't want your measurement system to fluctuate when all other things are static.
Have a set of participants answer a set of questions (or perform a set of tasks). Later (by at least a few
days, typically), have them answer the same questions again. When you correlate the two sets of
measures,look for very high correlations (r > 0.7) to establish retest reliability.
4. As you can see,there's some effort and planning involved: you need for participants to agree to
answer the same questions twice. Few questionnaires measure test-retest reliability (mostly because of
the logistics), but with the proliferation of online research,we should encourage more of this type of
measure.
Parallel Forms Reliability
Getting the same or very similar results from slight variations on the question or evaluation method
also establishes reliability. One way to achieve this is to have, 20 items that measure one construct
and to administer 10 of the items to one group and the other 10 to another group, and then correlate
the results. You're looking for high correlations and no systematic difference in scores between the
groups.
Internal Consistency Reliability
It measures how consistently participants respond to one set of items .This is by far the most
commonly used measure of reliability in applied settings. It's popular because it's the easiest to
compute using software—it requires only one sample of data to estimate the internal consistency
reliability. This measure of reliability is described most often using Cronbach's alpha (sometimes
called coefficient alpha).
The more items you have, the more internally reliable the instrument, so to increase internal
consistency reliability, you would add items to your questionnaire. Since there's often a strong need to
have few items, however, internal reliability usually suffers. When you have only a few items, and
therefore usually lower internal reliability, having a larger sample size helps offset the loss in
reliability.
5. A. Average inter-item correlation is a subtype of internal consistency
reliability. It is obtained by taking all of the items on a test that probe the
same construct (e.g., reading comprehension), determining the correlation
coefficient for each pair of items, and finally taking the average of all of these
correlation coefficients. This final step yields the average inter-item
correlation.
B. Split-half reliability is another subtype of internal consistency
reliability. The process of obtaining split-half reliability is begun by “splitting
in half” all items of a test that are intended to probe the same area of
knowledge (e.g., World War II) in order to form two “sets” of
items. The entire test is administered to a group of individuals, the total score
for each “set” is computed, and finally the split-half reliability is obtained by
determining the correlation between the two total “set” scores.
CRONBACH’S ALPHA METHOD
Internal consistency is usually measured with Cronbach's alpha, a statistic calculated from the
pairwise correlations between items. Internal consistency ranges between negative infinity and
one. Coefficient alpha will be negative whenever there is greater within-subject variability than
between-subject variability. A commonly accepted rule of thumb for describing internal
consistency is as follows