Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Matthew Gray Exit Presentation Summ... by Matthew Gray 143 views
- Mems project by abhishek mahajan by Abhishek Mahajan 98 views
- Global Voice/Speech Recognition Sys... by ReportLinker.com 615 views
- Latest Report On Global Speech & Vo... by smithgordon 105 views
- Recently Study On Silicone Coating ... by Market Research R... 80 views
- Report : Speech & Voice Recognition... by Market Research R... 307 views

1,231 views

Published on

2015 Joint International Methodology Research Colloquiumでのスライドです。

Published in:
Education

No Downloads

Total views

1,231

On SlideShare

0

From Embeds

0

Number of Embeds

57

Shares

0

Downloads

3

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Evaluation of the reliability for L2 speech rating in discourse completion test Yusuke Kondo and Yutaka Ishii
- 2. Prediction method used in automated scoring system for L2 1 0 01 Item x Item x Predictors Speech rate Pitch range Mean length of utterance 2
- 3. Predictor examination 0 1 01 IndexA Index B IndexC Index D Good predictors Bad predictors When we try to predict scores using two indices …, 3
- 4. Unreliable rating = 0 = 1 IndexA Index B The first rating The second rating IndexA Index B 4
- 5. Ishii and Kondo (2015) 5 .27 .57 Our own ratings Ratings in Narita (2013)
- 6. Agreement of automated scoring with raters Group Correlation % Exact Agreement % Adjacent Agreement Kappa Weighted Kappa Naïve .77 41 89 .27 .75 Untrained .61 31 73 .16 .59 Certificated (Average) .92 70 99 .62 .91 Certificated (Exemplary) .95 80 100 .76 .94 Powers, Escoffery, and Duchnowski (2015) Applied Measurement in Education Untrained < Naïve < Certificated (Average) < Certificated (Exemplary) 6
- 7. Comes as no surprise • Reliable rating is absolutely essential for the construction of automated scoring system. 7
- 8. Then, • how do we evaluate reliability in L2 performance? • What index should be used? 8
- 9. Outline • Reliability indices in L2 performance assessment • Reliability indices in psychometrics • Observation of reliability indices • Some comments and suggestions 9
- 10. Language Testing 30-32 • Reliability indices used 1. Cronbach’s Alpha 2. Percentage of agreements 3. Cohen’s kappa 4. Spearman rank correlation coefficient 5. Pearson correlation coefficient 6. Infit and Outfit measures (IRT) 7. Root-mean-square deviation 10
- 11. Alpha in rating data • Bachman (2004) “coefficient alpha should be used” • Bachman’s recommendation is introduced in Carr (2011) and Sawaki (2013). 11
- 12. Journals on psychometrics • Reliability indices discussed 1. Polychoric correlation coefficient 2. McDonald’s omega 3. Intraclass correlation coefficient 4. Standard deviation of correlation coefficients 5. Means of correlation coefficients 12
- 13. Next, • we will be looking at how the reliability indices behave in our rating data. 13
- 14. Data • 30 different discourse completion task completed by 44-60 university students. • Each utterance was rated by different three raters 14
- 15. Example When you (A) want to ask your friend about their weekend, what would you say in the conversation below? A: ( ) B: We went shopping. 15
- 16. Rating criteria Score Description 3 Can understand the speaker’s intention. Natural pronunciation and Intonation. Almost no foreign accentedness. 2 Can understand the speaker’s intention, but can find some foreign accents. 1 Can’t understand the speakers’ intention because of strong foreign accents 0 Can’t catch the utterance because of low voice or noise. 16
- 17. Target indices • Cronbach’s alpha – Kendall – Spearman – Pearson – Polychoric • McDonald’s omega • Mean of correlation coefficients • Fleiss’ kappa • Percentage of exact and adjacent agreement 17
- 18. Data frame α_k α_spe α_pea α_pol . . . κ % Item 1 .47 .53 .48 .74 . . . .22 .75 Item 2 .56 .55 .55 .67 . . . .25 .80 Item 3 .62 .67 .64 .59 . . . .30 .90 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Item 30 .66 .86 .67 .92 . . . .47 .66 18
- 19. Much the same. Mean of correlation coefficients Cronbach’s alpha McDonald’s omega 19
- 20. Correlations among coefficients Cronbach’s Alpha alpha_ken 0.5 0.6 0.7 0.8 0.99 0.91 0.4 0.6 0.8 0.450.600.75 0.79 0.50.60.70.8 alpha_spe 0.93 0.81 alpha_pea 0.500.650.80 0.81 0.45 0.60 0.75 0.40.60.8 0.50 0.65 0.80 alpha_pol Mean of Correlation Coefficients m_ken 0.3 0.4 0.5 0.6 1.00 0.92 0.2 0.4 0.6 0.8 0.20.30.40.5 0.74 0.30.40.50.6 m_spe 0.94 0.76 m_pea 0.30.40.50.6 0.78 0.2 0.3 0.4 0.5 0.20.40.60.8 0.3 0.4 0.5 0.6 m_pol 20
- 21. Correlations among coefficients McDonald’s omega omegah_ken 0.50 0.60 0.70 0.80 0.97 0.86 0.3 0.5 0.7 0.9 0.500.600.700.80 0.69 0.500.600.700.80 omegah_spe 0.91 0.73 omegah_pea 0.550.650.750.85 0.67 0.50 0.60 0.70 0.80 0.30.50.70.9 0.55 0.65 0.75 0.85 omegah_pol 21
- 22. Comment • Much the same results can be obtained by Spearman’s and Pearson’s in 4-point scale. 22
- 23. Suggestion • Polychoric correlation coefficients should be used, if you would prefer not to violate statistical constraints and/or to underestimate the reliability of your data. 23
- 24. Reason • Pearson’s should not be used for rating data. • Use Spearman’s instead. • But, their correlation is extremely high. • They might share their construct. 24
- 25. Correlation among indices Kendall’s based indices m_ken 0.45 0.55 0.65 0.75 0.99 0.20.30.40.5 0.97 0.450.550.650.75 alpha_ken 0.97 0.2 0.3 0.4 0.5 0.50 0.60 0.70 0.80 0.500.600.700.80 omegah_ken Spearman’s-based indices m_spe 0.5 0.6 0.7 0.8 0.99 0.30.40.50.6 0.96 0.50.60.70.8 alpha_spe 0.97 0.3 0.4 0.5 0.6 0.50 0.60 0.70 0.80 0.500.600.700.80 omegah_spe 25
- 26. Correlation among indices Pearson’s-based indices m_pea 0.50 0.60 0.70 0.80 0.99 0.30.40.50.6 0.95 0.500.600.700.80 alpha_pea 0.95 0.3 0.4 0.5 0.6 0.55 0.65 0.75 0.85 0.550.650.750.85 omegah_pea Polychoric-based indices alpha_pol 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.94 0.40.50.60.70.80.9 0.98 0.30.40.50.60.70.80.9 omegah_pol 0.88 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.20.30.40.50.60.70.80.9 m_pol 26
- 27. Suggestion • Mean of correlation coefficients, Cronbach’s alpha, and McDonald’s omega, you can use any of them. 27
- 28. ICC, Kappa, and % α M of r ω ICC κ % α 1 .98 .94 .75 .54 .53 M of r .98 1 .88 .72 .54 .44 ω .94 .88 1 .74 .48 .58 ICC .75 .72 .74 1 .81 .72 κ .54 .54 .48 .81 1 .61 % .53 .44 .58 .72 .61 1 α : α using polychoric correlation coefficients M of r : Mean of polychoric correlation coefficients ω : ω using polychoric correlation coefficients ICC : Intraclass correlation coefficients κ : Fleiss’ kappa % : Percentage of exact and adjacent agreements 28
- 29. Comment • “Agreement” may be a construct different from “reliability.” 29 Rater A Rater B ↑ True score Agreement ↓
- 30. • One more thing, we have found 30
- 31. A feature of alpha A B C D E A 1 B .7 1 C .7 .7 1 D .7 .7 .7 1 E .7 .7 .7 .7 1 F G H I J F 1 G .9 1 H .9 .9 1 I .5 .5 .5 1 J .6 .6 .6 .9 1 Table 1: Item A Table 2: Item B 𝛼 = .92 𝛼 = .92 The tables were created, based on Schmitt (1996) Psychological Assessment To show the difference, SD of correlation coefficients is recommended to be reported. 31
- 32. In our data K L M K 1 L .80 1 M .45 .90 1 0.05 0.10 0.15 0.20 0.4 0.6 0.8 Alpha SD N O P N 1 O .95 1 P .92 .76 1 32
- 33. Comments • Even if we obtain much the same alphas, the correlations among raters are different in two items. 33
- 34. Another feature of alpha Q R S Q 1 R .7 1 S .7 .7 1 T U V X Y Z T 1 U .7 1 V .7 .7 1 X .7 .7 .7 1 Y .7 .7 .7 .7 1 Z .7 .7 .7 .7 .7 1 𝛼 = .87 𝛼 = .93 a b c d e f a 1 b .5 1 c .5 .5 1 d .5 .5 .5 1 e .5 .5 .5 .5 1 f .5 .5 .5 .5 .5 1 𝛼 = .86 34
- 35. Final suggestions • When you report on the reliability in the rating data with more than two raters, – Polychoric correlation coefficients should be used. – SD of correlation coefficients among raters is recommended to be reported. – Mean of correlation coefficients might be used instead of alpha (, which might be more comprehensible than alpha). 35
- 36. Outline • Reliability indices in L2 performance assessment • Reliability indices in psychometrics • Observation of reliability indices • Some comments and suggestions 36

No public clipboards found for this slide

Be the first to comment