Dr. Sanchita Garai, Mr. Amitava Panja & Dr. Sanjit Maiti
Reliability in Behavioural Research
Context of ‘reliability’ and ‘validity’
• Quantification of human behaviour: Measuring instruments to observe human
behaviour
• Two important aspects of measuring instruments:
a. Validity
b. Reliability
Source: Mohajan (2017)
What is Reliability?
Synonyms for reliability: Dependability, Stability, Consistency, Accuracy
Three approaches for defining reliability:
a. Stability approach:
“If we measure the same set of objects again and again with the same or comparable
measuring instrument, will we get the same or similar results?”
b. Accuracy approach:
“ Are the measures obtained from a measuring instrument the “true” measures of the property
measured?”
c. Solving theoretical or practical problems:
“How much error of measurement there is in a measuring instrument?”
Definition: Reliability is the accuracy or precision of a measuring instrument.
Defining ‘Reliability’
• Reliability is the accuracy or precision of a measuring instrument. (Kerlinger, 1964)
• Reliability is the agreement between two efforts to measure the same trait through
maximally similar method. (Campbell & Fiske, 1967)
• Reliability is a major concern when a psychological test is used to measure some
attribute or behaviour. (Rosenthal and Rosnow, 1991)
• Reliability is consistency of measurement (Bollen, 1989), or stability of measurement
over a variety of conditions in which basically the same results should be obtained
(Nunnally, 1978).
Estimates of Reliability
Methods of estimating reliability can be roughly categorized into two groups:
a. Two separate test administrations
b. One test administration
• Typical methods to estimate test reliability in behavioural research are:
a. Test-retest reliability
b. Alternative forms
c. Split-halves
d. Inter-rater reliability
e. Internal consistency
• There are three main concerns in reliability testing: equivalence, stability over
time, and internal consistency.
Estimates of Reliability
Source: Drost (2011)
• Refers to the temporal stability of a test from one measurement session to another.
• Administer the test to a group of respondents and then administer the same test to the same
respondents at a later date. The correlation between scores on the identical tests given at different
times operationally defines its test-retest reliability.
• Despite its appeal, the test-retest reliability technique has several limitations (Rosenthal & Rosnow,
1991).
a. Memory effect: When the interval between the first and second test is too short, respondents
might remember what was on the first test and their answers on the second test could be affected by
memory.
b. Maturation effect: Happens when the interval between the two tests is too long, difference.
Maturation refers to changes in the subject factors or respondents (other than those associated
with the independent variable) that occur over time and cause a change from the initial
measurements to the later measurements (t and t + 1).
Test-Retest Reliability
Test retest example
• This method involves constructing two similar forms of a test/scale (i.e. both forms have the
same content) and administering both forms to the same group of examinees within a very
short time period.
• The correlation between observed scores on the alternate test/scale forms, is n estimate of
the reliability of either one of the alternate forms.
• This correlation coefficient is known as coefficient of equivalence.
• If the correlation between the alternative forms is low, it could indicate that considerable
measurement error is present, because two different scales were used.
• Several of the limits of the test-retest method are also true of the alternative forms technique .
Alternative forms of reliability
Alternative forms
Feature Test-Retest Reliability Alternate Forms Reliability
Test Version The same test is administered twice.
Two different versions (called
alternate forms) of the test are
created. Ideally, these forms measure
the same skill or knowledge.
Administration
The test is administered twice to the
same group of people. The time gap
between the two administrations can
vary depending on the type of test.
Each form of the test is administered to
the same group of people, ideally at the
same time.
Time Gap
Measured between the two
administrations of the same test.
Not applicable. Both forms are
administered ideally at the same
time.
Measures Consistency Of
Scores on the same test over time. This
can be affected by memory, practice
effects, or changes in the participants'
knowledge or skills.
Scores on two different but equivalent
versions of the test. This helps assess if
the test itself is consistent across
different versions.
• Under this method, test/scale developers divide the scale/test into two halves, so that the
first half forms the first part of the entire test/scale and the second half forms the
remaining part of the test/scale.
• Both halves are normally of equal lengths and they are designed in such a way that each
is an alternate form of the other.
• Estimation of reliability is based on correlating the results of the two halves of the
same test/scale.
• In contrast to the test-retest and alternative form methods, the split-half approach is
usually measured in the same time period. The correlation between the two halves tests must
be corrected to obtain the reliability coefficient for the whole test (Nunnally, 1978; Bollen, 1989).
Split-half approach
Split-half approach
Advantages and disadvantages of
Split-half approach
Advantage:
• First, the effect of memory discussed previously does not operate with this approach.
• Also, a practical advantage is that the split-halves are usually cheaper and more easily obtained
than over time data (Bollen, 1989).
Disadvantage:
• Tests must be parallel measures - that is, the correlation between the two halves will vary slightly
depending on how the items are divided.
• Nunnally (1978) suggests using the split-half method when measuring variability of behaviours over
short periods of time when alternative forms are not available.
• For example, the even items can first be given as a test and, subsequently, on the second occasion, the odd
items as the alternative form. The corrected correlation coefficient between the even and odd item test scores
will indicate the relative stability of the behaviour over that period of time.
Example: Two judges rating 10 persons on a particular test (i.e., judges rating people‘s
competency in their writing skills).
• The correlation between the ratings made by the two judges will tell us the reliability of either
judge in the specific situation.
Interrater reliability
When raters or judges are used to measure behaviour, the reliability of their judgments
or combined internal consistency of judgments is assessed (Rosenthal & Rosnow, 1991).
• Concerns the reliability of the test components.
• Internal consistency measures consistency within the instrument and questions how well a set
of items measures a particular behaviour or characteristic within the test.
• For a test to be internally consistent, estimates of reliability are based on the average
intercorrelations among all the single items within a test.
• The most popular method of testing for internal consistency in the behavioural sciences is
coefficient alpha.
• Coefficient alpha was popularized by Cronbach (1951), who recognised its general usefulness. As
a result, it is often referred to as Cronbach’s alpha.
• Coefficients of internal consistency increases as the number of items goes up, to a certain point.
Internal consistency of reliability
• This is the most widely used method of estimating reliability using a single test
administration.
• Cronbach’s Alpha (𝛼) is calculated based on the following formula:
• Where,
k is the number of items on the test/scale
𝜎2𝑖 is the variance of item i,
𝜎2𝑥 is the total test variance
According to Kerlinger (1964), principle behind the improvement of reliability in a slightly
different form of ‘MaxMinCon’ principle-
“Maximize the variance of the individual differences and minimize the error variance”
• Writing the items of psychological and educational measuring instruments unambiguously.
• If an instrument is not reliable enough, add more items of equal kind and quality.
• Clear and standard instructions tend to reduce errors of measurement.
Improvement of reliability
Reliability in behavioural research with practical example

Reliability in behavioural research with practical example

  • 1.
    Dr. Sanchita Garai,Mr. Amitava Panja & Dr. Sanjit Maiti Reliability in Behavioural Research
  • 2.
    Context of ‘reliability’and ‘validity’ • Quantification of human behaviour: Measuring instruments to observe human behaviour • Two important aspects of measuring instruments: a. Validity b. Reliability Source: Mohajan (2017)
  • 3.
    What is Reliability? Synonymsfor reliability: Dependability, Stability, Consistency, Accuracy Three approaches for defining reliability: a. Stability approach: “If we measure the same set of objects again and again with the same or comparable measuring instrument, will we get the same or similar results?” b. Accuracy approach: “ Are the measures obtained from a measuring instrument the “true” measures of the property measured?” c. Solving theoretical or practical problems: “How much error of measurement there is in a measuring instrument?” Definition: Reliability is the accuracy or precision of a measuring instrument.
  • 4.
    Defining ‘Reliability’ • Reliabilityis the accuracy or precision of a measuring instrument. (Kerlinger, 1964) • Reliability is the agreement between two efforts to measure the same trait through maximally similar method. (Campbell & Fiske, 1967) • Reliability is a major concern when a psychological test is used to measure some attribute or behaviour. (Rosenthal and Rosnow, 1991) • Reliability is consistency of measurement (Bollen, 1989), or stability of measurement over a variety of conditions in which basically the same results should be obtained (Nunnally, 1978).
  • 5.
    Estimates of Reliability Methodsof estimating reliability can be roughly categorized into two groups: a. Two separate test administrations b. One test administration • Typical methods to estimate test reliability in behavioural research are: a. Test-retest reliability b. Alternative forms c. Split-halves d. Inter-rater reliability e. Internal consistency • There are three main concerns in reliability testing: equivalence, stability over time, and internal consistency.
  • 6.
  • 7.
    • Refers tothe temporal stability of a test from one measurement session to another. • Administer the test to a group of respondents and then administer the same test to the same respondents at a later date. The correlation between scores on the identical tests given at different times operationally defines its test-retest reliability. • Despite its appeal, the test-retest reliability technique has several limitations (Rosenthal & Rosnow, 1991). a. Memory effect: When the interval between the first and second test is too short, respondents might remember what was on the first test and their answers on the second test could be affected by memory. b. Maturation effect: Happens when the interval between the two tests is too long, difference. Maturation refers to changes in the subject factors or respondents (other than those associated with the independent variable) that occur over time and cause a change from the initial measurements to the later measurements (t and t + 1). Test-Retest Reliability
  • 8.
  • 9.
    • This methodinvolves constructing two similar forms of a test/scale (i.e. both forms have the same content) and administering both forms to the same group of examinees within a very short time period. • The correlation between observed scores on the alternate test/scale forms, is n estimate of the reliability of either one of the alternate forms. • This correlation coefficient is known as coefficient of equivalence. • If the correlation between the alternative forms is low, it could indicate that considerable measurement error is present, because two different scales were used. • Several of the limits of the test-retest method are also true of the alternative forms technique . Alternative forms of reliability
  • 10.
  • 12.
    Feature Test-Retest ReliabilityAlternate Forms Reliability Test Version The same test is administered twice. Two different versions (called alternate forms) of the test are created. Ideally, these forms measure the same skill or knowledge. Administration The test is administered twice to the same group of people. The time gap between the two administrations can vary depending on the type of test. Each form of the test is administered to the same group of people, ideally at the same time. Time Gap Measured between the two administrations of the same test. Not applicable. Both forms are administered ideally at the same time. Measures Consistency Of Scores on the same test over time. This can be affected by memory, practice effects, or changes in the participants' knowledge or skills. Scores on two different but equivalent versions of the test. This helps assess if the test itself is consistent across different versions.
  • 13.
    • Under thismethod, test/scale developers divide the scale/test into two halves, so that the first half forms the first part of the entire test/scale and the second half forms the remaining part of the test/scale. • Both halves are normally of equal lengths and they are designed in such a way that each is an alternate form of the other. • Estimation of reliability is based on correlating the results of the two halves of the same test/scale. • In contrast to the test-retest and alternative form methods, the split-half approach is usually measured in the same time period. The correlation between the two halves tests must be corrected to obtain the reliability coefficient for the whole test (Nunnally, 1978; Bollen, 1989). Split-half approach
  • 14.
  • 16.
    Advantages and disadvantagesof Split-half approach Advantage: • First, the effect of memory discussed previously does not operate with this approach. • Also, a practical advantage is that the split-halves are usually cheaper and more easily obtained than over time data (Bollen, 1989). Disadvantage: • Tests must be parallel measures - that is, the correlation between the two halves will vary slightly depending on how the items are divided. • Nunnally (1978) suggests using the split-half method when measuring variability of behaviours over short periods of time when alternative forms are not available. • For example, the even items can first be given as a test and, subsequently, on the second occasion, the odd items as the alternative form. The corrected correlation coefficient between the even and odd item test scores will indicate the relative stability of the behaviour over that period of time.
  • 17.
    Example: Two judgesrating 10 persons on a particular test (i.e., judges rating people‘s competency in their writing skills). • The correlation between the ratings made by the two judges will tell us the reliability of either judge in the specific situation. Interrater reliability When raters or judges are used to measure behaviour, the reliability of their judgments or combined internal consistency of judgments is assessed (Rosenthal & Rosnow, 1991).
  • 18.
    • Concerns thereliability of the test components. • Internal consistency measures consistency within the instrument and questions how well a set of items measures a particular behaviour or characteristic within the test. • For a test to be internally consistent, estimates of reliability are based on the average intercorrelations among all the single items within a test. • The most popular method of testing for internal consistency in the behavioural sciences is coefficient alpha. • Coefficient alpha was popularized by Cronbach (1951), who recognised its general usefulness. As a result, it is often referred to as Cronbach’s alpha. • Coefficients of internal consistency increases as the number of items goes up, to a certain point. Internal consistency of reliability
  • 19.
    • This isthe most widely used method of estimating reliability using a single test administration. • Cronbach’s Alpha (𝛼) is calculated based on the following formula: • Where, k is the number of items on the test/scale 𝜎2𝑖 is the variance of item i, 𝜎2𝑥 is the total test variance
  • 20.
    According to Kerlinger(1964), principle behind the improvement of reliability in a slightly different form of ‘MaxMinCon’ principle- “Maximize the variance of the individual differences and minimize the error variance” • Writing the items of psychological and educational measuring instruments unambiguously. • If an instrument is not reliable enough, add more items of equal kind and quality. • Clear and standard instructions tend to reduce errors of measurement. Improvement of reliability