Comes as no surprise
• Reliable rating is absolutely
essential for the construction of
automated scoring system.
7
Then,
• how do we evaluate reliability in
L2 performance?
• What index should be used?
8
Outline
• Reliability indices in L2
performance assessment
• Reliability indices in
psychometrics
• Observation of reliability indices
• Some comments and suggestions
9
Language Testing 30-32
• Reliability indices used
1. Cronbach’s Alpha
2. Percentage of agreements
3. Cohen’s kappa
4. Spearman rank correlation coefficient
5. Pearson correlation coefficient
6. Infit and Outfit measures (IRT)
7. Root-mean-square deviation
10
Alpha in rating data
• Bachman (2004) “coefficient
alpha should be used”
• Bachman’s recommendation is
introduced in Carr (2011) and
Sawaki (2013).
11
Journals on psychometrics
• Reliability indices discussed
1. Polychoric correlation coefficient
2. McDonald’s omega
3. Intraclass correlation coefficient
4. Standard deviation of correlation coefficients
5. Means of correlation coefficients
12
Next,
• we will be looking at how the
reliability indices behave in our
rating data.
13
Data
• 30 different discourse completion
task completed by 44-60
university students.
• Each utterance was rated by
different three raters
14
Example
When you (A) want to ask your friend
about their weekend, what would you
say in the conversation below?
A: ( )
B: We went shopping.
15
Rating criteria
Score Description
3
Can understand the speaker’s intention. Natural pronunciation and
Intonation. Almost no foreign accentedness.
2 Can understand the speaker’s intention, but can find some foreign accents.
1 Can’t understand the speakers’ intention because of strong foreign accents
0 Can’t catch the utterance because of low voice or noise.
16
Target indices
• Cronbach’s alpha
– Kendall
– Spearman
– Pearson
– Polychoric
• McDonald’s omega
• Mean of correlation
coefficients
• Fleiss’ kappa
• Percentage of exact and
adjacent agreement
17
Comment
• Much the same results can be
obtained by Spearman’s and
Pearson’s in 4-point scale.
22
Suggestion
• Polychoric correlation coefficients
should be used, if you would
prefer not to violate statistical
constraints and/or to
underestimate the reliability of
your data.
23
Reason
• Pearson’s should not be used for
rating data.
• Use Spearman’s instead.
• But, their correlation is extremely
high.
• They might share their construct.
24
A feature of alpha
A B C D E
A 1
B .7 1
C .7 .7 1
D .7 .7 .7 1
E .7 .7 .7 .7 1
F G H I J
F 1
G .9 1
H .9 .9 1
I .5 .5 .5 1
J .6 .6 .6 .9 1
Table 1: Item A Table 2: Item B
𝛼 = .92 𝛼 = .92
The tables were created, based on Schmitt (1996)
Psychological Assessment
To show the difference, SD of correlation coefficients is
recommended to be reported.
31
In our data
K L M
K 1
L .80 1
M .45 .90 1
0.05
0.10
0.15
0.20
0.4 0.6 0.8
Alpha
SD
N O P
N 1
O .95 1
P .92 .76 1
32
Comments
• Even if we obtain much the same
alphas, the correlations among
raters are different in two items.
33
Another feature of alpha
Q R S
Q 1
R .7 1
S .7 .7 1
T U V X Y Z
T 1
U .7 1
V .7 .7 1
X .7 .7 .7 1
Y .7 .7 .7 .7 1
Z .7 .7 .7 .7 .7 1
𝛼 = .87
𝛼 = .93
a b c d e f
a 1
b .5 1
c .5 .5 1
d .5 .5 .5 1
e .5 .5 .5 .5 1
f .5 .5 .5 .5 .5 1 𝛼 = .86
34
Final suggestions
• When you report on the
reliability in the rating data with
more than two raters,
– Polychoric correlation coefficients should be used.
– SD of correlation coefficients among raters is
recommended to be reported.
– Mean of correlation coefficients might be used
instead of alpha (, which might be more
comprehensible than alpha).
35
Outline
• Reliability indices in L2
performance assessment
• Reliability indices in
psychometrics
• Observation of reliability indices
• Some comments and suggestions
36