Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Evaluation of the reliability for L2 speech rating in discourse completion testMethoken in seoul

1,231 views

Published on

2015 Joint International Methodology Research Colloquiumでのスライドです。

Published in: Education
  • Be the first to comment

Evaluation of the reliability for L2 speech rating in discourse completion testMethoken in seoul

  1. 1. Evaluation of the reliability for L2 speech rating in discourse completion test Yusuke Kondo and Yutaka Ishii
  2. 2. Prediction method used in automated scoring system for L2 1 0 01 Item x Item x Predictors Speech rate Pitch range Mean length of utterance 2
  3. 3. Predictor examination 0 1 01 IndexA Index B IndexC Index D Good predictors Bad predictors When we try to predict scores using two indices …, 3
  4. 4. Unreliable rating = 0 = 1 IndexA Index B The first rating The second rating IndexA Index B 4
  5. 5. Ishii and Kondo (2015) 5 .27 .57 Our own ratings Ratings in Narita (2013)
  6. 6. Agreement of automated scoring with raters Group Correlation % Exact Agreement % Adjacent Agreement Kappa Weighted Kappa Naïve .77 41 89 .27 .75 Untrained .61 31 73 .16 .59 Certificated (Average) .92 70 99 .62 .91 Certificated (Exemplary) .95 80 100 .76 .94 Powers, Escoffery, and Duchnowski (2015) Applied Measurement in Education Untrained < Naïve < Certificated (Average) < Certificated (Exemplary) 6
  7. 7. Comes as no surprise • Reliable rating is absolutely essential for the construction of automated scoring system. 7
  8. 8. Then, • how do we evaluate reliability in L2 performance? • What index should be used? 8
  9. 9. Outline • Reliability indices in L2 performance assessment • Reliability indices in psychometrics • Observation of reliability indices • Some comments and suggestions 9
  10. 10. Language Testing 30-32 • Reliability indices used 1. Cronbach’s Alpha 2. Percentage of agreements 3. Cohen’s kappa 4. Spearman rank correlation coefficient 5. Pearson correlation coefficient 6. Infit and Outfit measures (IRT) 7. Root-mean-square deviation 10
  11. 11. Alpha in rating data • Bachman (2004) “coefficient alpha should be used” • Bachman’s recommendation is introduced in Carr (2011) and Sawaki (2013). 11
  12. 12. Journals on psychometrics • Reliability indices discussed 1. Polychoric correlation coefficient 2. McDonald’s omega 3. Intraclass correlation coefficient 4. Standard deviation of correlation coefficients 5. Means of correlation coefficients 12
  13. 13. Next, • we will be looking at how the reliability indices behave in our rating data. 13
  14. 14. Data • 30 different discourse completion task completed by 44-60 university students. • Each utterance was rated by different three raters 14
  15. 15. Example When you (A) want to ask your friend about their weekend, what would you say in the conversation below? A: ( ) B: We went shopping. 15
  16. 16. Rating criteria Score Description 3 Can understand the speaker’s intention. Natural pronunciation and Intonation. Almost no foreign accentedness. 2 Can understand the speaker’s intention, but can find some foreign accents. 1 Can’t understand the speakers’ intention because of strong foreign accents 0 Can’t catch the utterance because of low voice or noise. 16
  17. 17. Target indices • Cronbach’s alpha – Kendall – Spearman – Pearson – Polychoric • McDonald’s omega • Mean of correlation coefficients • Fleiss’ kappa • Percentage of exact and adjacent agreement 17
  18. 18. Data frame α_k α_spe α_pea α_pol . . . κ % Item 1 .47 .53 .48 .74 . . . .22 .75 Item 2 .56 .55 .55 .67 . . . .25 .80 Item 3 .62 .67 .64 .59 . . . .30 .90 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Item 30 .66 .86 .67 .92 . . . .47 .66 18
  19. 19. Much the same. Mean of correlation coefficients Cronbach’s alpha McDonald’s omega 19
  20. 20. Correlations among coefficients Cronbach’s Alpha alpha_ken 0.5 0.6 0.7 0.8 0.99 0.91 0.4 0.6 0.8 0.450.600.75 0.79 0.50.60.70.8 alpha_spe 0.93 0.81 alpha_pea 0.500.650.80 0.81 0.45 0.60 0.75 0.40.60.8 0.50 0.65 0.80 alpha_pol Mean of Correlation Coefficients m_ken 0.3 0.4 0.5 0.6 1.00 0.92 0.2 0.4 0.6 0.8 0.20.30.40.5 0.74 0.30.40.50.6 m_spe 0.94 0.76 m_pea 0.30.40.50.6 0.78 0.2 0.3 0.4 0.5 0.20.40.60.8 0.3 0.4 0.5 0.6 m_pol 20
  21. 21. Correlations among coefficients McDonald’s omega omegah_ken 0.50 0.60 0.70 0.80 0.97 0.86 0.3 0.5 0.7 0.9 0.500.600.700.80 0.69 0.500.600.700.80 omegah_spe 0.91 0.73 omegah_pea 0.550.650.750.85 0.67 0.50 0.60 0.70 0.80 0.30.50.70.9 0.55 0.65 0.75 0.85 omegah_pol 21
  22. 22. Comment • Much the same results can be obtained by Spearman’s and Pearson’s in 4-point scale. 22
  23. 23. Suggestion • Polychoric correlation coefficients should be used, if you would prefer not to violate statistical constraints and/or to underestimate the reliability of your data. 23
  24. 24. Reason • Pearson’s should not be used for rating data. • Use Spearman’s instead. • But, their correlation is extremely high. • They might share their construct. 24
  25. 25. Correlation among indices Kendall’s based indices m_ken 0.45 0.55 0.65 0.75 0.99 0.20.30.40.5 0.97 0.450.550.650.75 alpha_ken 0.97 0.2 0.3 0.4 0.5 0.50 0.60 0.70 0.80 0.500.600.700.80 omegah_ken Spearman’s-based indices m_spe 0.5 0.6 0.7 0.8 0.99 0.30.40.50.6 0.96 0.50.60.70.8 alpha_spe 0.97 0.3 0.4 0.5 0.6 0.50 0.60 0.70 0.80 0.500.600.700.80 omegah_spe 25
  26. 26. Correlation among indices Pearson’s-based indices m_pea 0.50 0.60 0.70 0.80 0.99 0.30.40.50.6 0.95 0.500.600.700.80 alpha_pea 0.95 0.3 0.4 0.5 0.6 0.55 0.65 0.75 0.85 0.550.650.750.85 omegah_pea Polychoric-based indices alpha_pol 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.94 0.40.50.60.70.80.9 0.98 0.30.40.50.60.70.80.9 omegah_pol 0.88 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.20.30.40.50.60.70.80.9 m_pol 26
  27. 27. Suggestion • Mean of correlation coefficients, Cronbach’s alpha, and McDonald’s omega, you can use any of them. 27
  28. 28. ICC, Kappa, and % α M of r ω ICC κ % α 1 .98 .94 .75 .54 .53 M of r .98 1 .88 .72 .54 .44 ω .94 .88 1 .74 .48 .58 ICC .75 .72 .74 1 .81 .72 κ .54 .54 .48 .81 1 .61 % .53 .44 .58 .72 .61 1 α : α using polychoric correlation coefficients M of r : Mean of polychoric correlation coefficients ω : ω using polychoric correlation coefficients ICC : Intraclass correlation coefficients κ : Fleiss’ kappa % : Percentage of exact and adjacent agreements 28
  29. 29. Comment • “Agreement” may be a construct different from “reliability.” 29 Rater A Rater B ↑ True score Agreement ↓
  30. 30. • One more thing, we have found 30
  31. 31. A feature of alpha A B C D E A 1 B .7 1 C .7 .7 1 D .7 .7 .7 1 E .7 .7 .7 .7 1 F G H I J F 1 G .9 1 H .9 .9 1 I .5 .5 .5 1 J .6 .6 .6 .9 1 Table 1: Item A Table 2: Item B 𝛼 = .92 𝛼 = .92 The tables were created, based on Schmitt (1996) Psychological Assessment To show the difference, SD of correlation coefficients is recommended to be reported. 31
  32. 32. In our data K L M K 1 L .80 1 M .45 .90 1 0.05 0.10 0.15 0.20 0.4 0.6 0.8 Alpha SD N O P N 1 O .95 1 P .92 .76 1 32
  33. 33. Comments • Even if we obtain much the same alphas, the correlations among raters are different in two items. 33
  34. 34. Another feature of alpha Q R S Q 1 R .7 1 S .7 .7 1 T U V X Y Z T 1 U .7 1 V .7 .7 1 X .7 .7 .7 1 Y .7 .7 .7 .7 1 Z .7 .7 .7 .7 .7 1 𝛼 = .87 𝛼 = .93 a b c d e f a 1 b .5 1 c .5 .5 1 d .5 .5 .5 1 e .5 .5 .5 .5 1 f .5 .5 .5 .5 .5 1 𝛼 = .86 34
  35. 35. Final suggestions • When you report on the reliability in the rating data with more than two raters, – Polychoric correlation coefficients should be used. – SD of correlation coefficients among raters is recommended to be reported. – Mean of correlation coefficients might be used instead of alpha (, which might be more comprehensible than alpha). 35
  36. 36. Outline • Reliability indices in L2 performance assessment • Reliability indices in psychometrics • Observation of reliability indices • Some comments and suggestions 36

×