Subtle patterns of learner language: 13 topics for further research


Published on

A presentation of (some of) the subtle and hitherto undetected patterns in the lexicon of Norwegian language learners revealed by a Discriminant Analysis of texts in the ASK corpus.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • J&C focused on (1). One of my goals was to take this one step further by also doing (2)
  • I assume everyone is familiar with Jarvis’ framework for methodological rigour in transfer research, and the kinds of evidence he calls for (Jarvis 2000, Jarvis & Pavlenko 2008, Jarvis 2010). (The real Type 3 evidence can only be found in the head of the language user…)
  • Average proficiency level for SO and VI clearly lower than the others. Correlates negatively with linguistic and cultural distance. Likely that this also explains results for SH and SQ (for which figures are not available). But what about NL (for which we also do not have CEFR ratings)? Would expect them to be up there with DE and EN – in which case lower proficiency level cannot explain the fact that the pattern of overuse amongst NL learners. Could it be thematic bias?
  • The table compares the number of occurrences per topic for DE, EN, NL and SP speakers for the six topics that have the most occurrences of skal amongst NL speakers. The word count (wc), number of texts (tc) and the ratio between them is shown for each L1 group. Thus, NL speakers produced 39 occurrences of skal in the 8 texts entitled Framtida , a ratio of 4.9 occurrences per text. This figure can be compared with that for SP speakers writing on the same topic, i.e. 2.9 – a considerable difference. Comparisons can also be made between EN and SP speakers for the topic Bomiljø (‘Residential environment’), and between DE and EN speakers for the topic Nyheter (‘News’). In each case the ratio of occurrences of skal per text is consistently higher for NL speakers. In other words, even when choice of topic is held constant, the predilection of NL speakers for the word skal is still very clear. So what is the explanation ?
  • Caveat: Apparently there are other mechanisms for forming the future in Dutch, Koolhoven’s statement notwithstanding. One of them is to use the present tense with future meaning. A proper contrastive analysis is required to determine the relative frequencies of the various forms.
  • Wiktionary rankings are 102 and 5 respectively, with ‘a’ used more than 11 times more often than ‘an’.
  • FIXME !! йцукенгшщзхън фывапролджэ\ \ячсмитьбю.
  • FIXME!! йцукенгшщзхън фывапролджэ\ \ячсмитьбю. дом там љњертзуиопшђ асдфгхјклчћж <ѕџцвбнм,.- куђа је тамо qwertzuiopšđ asdfghjklčćž <yxcvbnm,.- kuća je tamo
  • FIXME!! йцукенгшщзхън фывапролджэ\ \ячсмитьбю.
  • FIXME!! йцукенгшщзхън фывапролджэ\ \ячсмитьбю.
  • Subtle patterns of learner language: 13 topics for further research

    1. 1. Subtle patterns of learner language Steve Pepper 2013-09-26 ASKeladden 13 topics for further research og er det å i jeg som en at på for de til ikke har med vi kan av man men om et så mange den varmå eller seg også mye veldig når være fra norge andre alle skal megdu vil noen hvis mer mennesker ha dette barn bare blirviktig fordi folk da han min barna hva noefå dem bli synes hvor selv etter hadde oss nå land år kommer ting gjøre alt enn dag der livet tror venner gå flere stor får trenger
    2. 2. Introduction • An application of the detection-based argument (Jarvis 2010) – Modelled on Jarvis & Crossley (2012) • Use of data mining methods to 1) automatically detect (predict) the L1 2) identify (lexical) features that serve to discriminate between L1 groups, i.e. L1 predictors • Major advantages: – Ability to recognize positive as well as negative transfer – Ability to detect very subtle patterns that might otherwise escape notice Jarvis & Crossley (2012)
    3. 3. Evidence of the third kind... • The method supplies the first two kinds of evidence “out of the box” – The focus here is therefore on supplying the third kind • Sources of type 3 evidence – the learner’s L1 performance – comparable users’ L1 performance – contrastive grammars – traditional grammars • Involves Contrastive Interlanguage Analysis (Granger 1996) – ILL2 < > NLL1 Evidence for transfer (Jarvis 2010) 1. Intergroup heterogeneity 2. Intragroup homogeneity 3. Cross-language congruity 4. Intralingual contrasts
    4. 4. L1 predictors • 55 features (i.e. words) selected using Discriminant Analysis (see box) – DA explained on Saturday at LCR 2013 • Subjected to post-hoc analysis using Tukey’s HSD – single-step multiple comparison procedure and statistical test that is used in conjunction with an ANOVA to find means that differ statistically from each other • The output is not very easy to interpret… andre, at, av, bare, barn, barna, bo, da, de, den, det, du, eller, en, enn, er, et, for, fordi, fra, han, har, hun, i, ikke, jeg, kan, liker, man, mange, med, meg, men, mennesker, mer, min, mye, norge, norsk, når, og, også, om, på, skal, som, sted, så, til, veldig, venner, vi, viktig, være, å
    5. 5. SH EN PL DE NO RU X Y Y Y Y X X X Df Sum Sq Mean Sq F value Pr(>F) myData$L1 5 1790 358.1 10.11 2.65e-09 *** Residuals 594 21044 35.4 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = myData[, X] ~ myData$L1) $`myData$L1` diff lwr upr p adj en-de -1.373 -3.7796269 1.03362692 0.5781845 no-de 0.032 -2.3746269 2.43862692 1.0000000 pl-de -0.239 -2.6456269 2.16762692 0.9997514 ru-de 3.186 0.7793731 5.59262692 0.0023298 sh-de -2.434 -4.8406269 -0.02737308 0.0456381 no-en 1.405 -1.0016269 3.81162692 0.5528485 pl-en 1.134 -1.2726269 3.54062692 0.7583997 ru-en 4.559 2.1523731 6.96562692 0.0000013 sh-en -1.061 -3.4676269 1.34562692 0.8063672 pl-no -0.271 -2.6776269 2.13562692 0.9995400 ru-no 3.154 0.7473731 5.56062692 0.0026907 sh-no -2.466 -4.8726269 -0.05937308 0.0409536 ru-pl 3.425 1.0183731 5.83162692 0.0007589 sh-pl -2.195 -4.6016269 0.21162692 0.0969624 sh-ru -5.620 -8.0266269 -3.21337308 0.0000000 sh en pl de no ru 2.806 3.867 5.001 5.240 5.272 8.426 feature: den NOTE: Tukey’s HSD was performed for groups of six L1s at a time. There were six such “groups of six”: – DE, EN, PL and RU were always included (along with the control group NO) – NL, SH, SP, SO, SQ and VI were each added in turn – The example above shows the homogeneity table for the group of L1s that includes SH – Examples to follow (including the next one) contain up to six homogeneity tables at once Essence represented visually as a “homogeneity table”
    6. 6. #1 NL speakers overuse skal • Finite form of modal auxiliary skulle; used to form the future tense han skal lage middag i kveld he will make dinner tonight – Other methods: • non-past: han lager middag i kveld • construction komme til + infinitive • Recognized tendency for beginners to overuse this form – Partly due to overly simplistic explanations in teaching materials • “Futurum lager vi av skal + infinitiv” (Greftegreff 1985) • Analysis shows that skal is overused by NL, SH, SO, SQ and VI learners RU DE EN NO PL NL Y X X X X X RU DE EN NO PL SH Y X X X X X RU DE EN NO PL SO Y X X X X X RU DE EN NO PL SP X X X X X X RU DE EN NO PL SQ Y X X X X X RU DE EN NO PL VI Y X X X X X ? proficiency ? thematic bias ? transfer
    7. 7. Proficiency? • We have CEFR ratings for 7 of the 10 L1 groups (not NL, SH, SQ) – VI and SO score lowest – DE and EN score highest • For these 7 L1 groups, overuse of skal thus correlates with linguistic and/or cultural distance – VI and SO communities in Norway originated as refugees – If lower proficiency explains overuse of skal for VI and SO, chances are that it also does so for SH and SQ – But this does not explain the NL case • So could the reason for NL users’ overuse be thematic bias? 0 20 40 60 80 100 SO VI SP RU PL EN DE A2 A2/B1 B1 B1/B2 B2 B2/C1 C1
    8. 8. Thematic bias? • Some topics are more concerned with future events than others – Over half the occurrences of skal are in 6 of the 46 topics • Cf. occurrences pr. text (“freq”) with the topic held constant – 4.9 (NL) >> 2.9 (SP) – 1.3 (NL) >> 0.5 (EN) and 0.6 (SP) – 1.1 (NL) >> 0.7 (DE) and 0.4 (EN) • Even with the topic held constant, the tendency is clear • Thematic bias can thus be ruled out DE EN NL SP wc tc freq wc tc freq wc tc freq wc tc freq Framtida - - -   - - -   39 8 4.9 29 10 2.9 Bomiljø - - -   20 38 0.5 21 16 1.3 14 23 0.6 Bolig og bosted - - -   - - -   13 9 1.4 - - -   Frivillig hjelp i  organisasjoner 2 5 0.4 - - -   9 2 4.5 - - -   Nyheter 7 10 0.7 4 9 0.4 8 7 1.1 2 -   Reise - - -   - - -   8 14 0.6 - - -  
    9. 9. Cross-linguistic explanation • In NL the future tenses are formed with the auxiliary zullen hij zal het diner vanavond maken • NL zullen cognate with skulle – finite form zal similar in form to skal – EN shall also cognate with skal and similar in form, but much less frequent in EN than ’ll, will and going to – DE werden is neither cognate nor similar in form • Conclusion: Strong tendency for NL speakers to overuse skal appears to be a case of formal lexical transfer – Caveat: NL has other means to express future action, including the non-past tense (hij maakt het diner vanavond) and the auxiliary gaan – Further investigation of relative frequencies necessary in order to confirm or disconfirm possible transfer effects ➔ Is there anything else that should be considered???
    10. 10. #2 DE speakers overuse en • Speakers of Slavic languages use the indefinite articles en (m.) and et (n.) much less frequently than learners from other L1 backgrounds – Also applies to SO, SQ and VI. As expected • But why do DE speakers use the masculine form en more than everyone else? – DE forms ein (m., n.), eine (f.) bear strong formal resemblance to en – Tendency to use en instead of et because of this? – Detailed error analysis required. • Hypothesis – That DE speakers commit errors of type <sic type="W" corr="et"><word>en</word></sic> more frequently than other L1 groups ➔ Comments??? PL RU EN NO NL DE Y Y Y X X X Y Y X X PL SH RU EN NO DE Y Y X X Y Y X X X PL RU SO EN NO DE Y Y X X X Y Y Y X X X PL RU SP EN NO DE Y Y X X X Y Y Y X X PL RU SQ EN NO DE Y Y X X X Y Y Y X X PL VI RU EN NO DE Y Y X X Y Y Y X X X
    11. 11. #3 EN speakers overuse et • Cross-linguistic explanation? – Avoidance of en (as indefinite article) due to identification with the numeral ‘one’? – Greater similarity between EN ‘a’ [ə] and NO et (short vowel, unvoiced dental plosive) than between ‘a’ and NO en (formal lexical transfer)? • Greater similarity between en and EN ‘an’, but ‘an’ much less frequent than ‘a’ – Wiktionary rankings #102 and #5 respectively – ‘a’ occurs 11 times more often that ‘an’ – Evidence that frequency constrains transfer? • Conclusion: L1 transfer appears to be at work when EN speakers overuse et ➔ But how can this be proved beyond doubt??? RU PL DE NL NO EN X X Y Y Y X X X RU PL SH DE NO EN Y Y X X X X SO RU PL DE NO EN X X Y Y X X X X RU PL DE SP NO EN X X Y Y Y X X X X RU PL SQ DE NO EN X X Y Y X X X X RU PL DE VI NO EN X X Y Y Y X X X X
    12. 12. #4 PL and RU speakers: den and det • These are 3SG pronouns, demonstratives, and (preposed) definite articles • RU speakers use den (m.) significantly more often than all other L1 groups, including PL speakers • PL speakers use det (n.) significantly more often than RU speakers – Absolute usage figures: • den PL 122, RU 166 (~40:60) • det PL 668, RU 496 (~60:40) ➔ Why??? ➔ How can we find out??? NOTE: • 3SG personal pronouns are identical in PL (on, ona, ono) and RU (он, она, оно) • Demonstrative pronouns – PL ten, ta, to – RU етот, ето, ета $den SH EN PL DE NO RU X Y Y Y Y X X X $det NO RU SH EN DE PL X X X X Y Y Y Y X X X X
    13. 13. #5 EN speakers overuse er • EN speakers use er ‘is, are’ statistically more than all other L1 groups (except PL and SH) • Most likely explanation: formal transfer – formal resemblance er [æɾ] ~ are [ɑ(ɹ)] EN NO sg pl sg pl 1. am are er er 2. are are er er 3. is are er er • High salience of ‘to be’ in English (not least because of present continuous) – And yet, ENPC shows finite forms of NO være to be more frequent than finite forms of EN be • 8,182 vs. 6,566 occurrences ➔ So how to explain EN overuse??? RU NO NL DE PL EN X X Y Y Y X X X X RU NO DE PL SH EN X X X Y Y Y X X X SO RU NO DE PL EN X X Y Y X X X X RU NO DE SP PL EN Y Y X X X Y Y Y X X X RU NO SQ DE PL EN X X Y Y Y X X X X RU VI NO DE PL EN X X Y Y X X X X
    14. 14. #6 While RU speakers underuse er • PL and SH speakers use er more than RU speakers – Despite the fact that they are all Slavic languages • PL and SH have a copula in the present tense (być and бити ~ biti) PL dom jest tam SH куђа је тамо ~ kuća je tamo ‘the house is there’ • RU no longer has such a copula RU дом _ там ‘the house is there’ ➔ Case proved??? RU NO NL DE PL EN X X Y Y Y X X X X RU NO DE PL SH EN X X X Y Y Y X X X SO RU NO DE PL EN X X Y Y X X X X RU NO DE SP PL EN Y Y X X X Y Y Y X X X RU NO SQ DE PL EN X X Y Y Y X X X X RU VI NO DE PL EN X X Y Y X X X X
    15. 15. #7 Many L1 groups underuse være Underuse by RU, SH, SO, SQ and VI Possible cross-linguistic explanations: RU no copula in present tense VI copula là not used with adjectives (because adjectives are verbal), thus: Mai là sinh viên ‘Mai is (a) student’ but Mai cao ‘Mai is tall’ SH copula exists but little used due to contact with other Balkan languages SO yahay ‘to be’ contracts with adjectives, losing its root (-ah-) in the process SQ no infinitives (është is finite form) ➔ Case proved??? RU NL DE PL NO EN Y Y Y Y Y X X X X X SH RU DE PL NO EN Y Y Y Y X X X X X SO RU DE PL NO EN X X X X Y Y Y Y X X X X RU DE PL NO SP EN Y Y Y Y Y X X X X SQ RU DE PL NO EN Y Y Y Y X X X X X VI RU DE PL NO EN Y Y Y Y X X X X X
    16. 16. #8 But EN speakers overuse være • Overuse by EN speakers – Difference is statistical w.r.t. RU, SH, SO, SQ and VI • Difference w.r.t. NO not statistical, but still noticeable – In the English-Norwegian Parallel Corpus, be occurs much more frequently in English texts (both fiction and non-fiction) than være does in Norwegian texts • be: 3,126 occurrences • være: 1,193 occurrences – Worthy of a more detailed investigation using ENPC ➔ Alternative explanations? RU NL DE PL NO EN Y Y Y Y Y X X X X X SH RU DE PL NO EN Y Y Y Y X X X X X SO RU DE PL NO EN X X X X Y Y Y Y X X X X RU DE PL NO SP EN Y Y Y Y Y X X X X SQ RU DE PL NO EN Y Y Y Y X X X X X VI RU DE PL NO EN Y Y Y Y X X X X X
    17. 17. #9 Prepositions i and på • Preposition på ‘on’ – EN (overuse) vs. DE (underuse) – Investigate using error analysis – Check type and token frequencies of constructions in which corresponding L1 forms (on and auf) are congruent in one L1 but not the other, e.g.: – NO på søndag ≡EN on Sunday but≠DE am Sonntag whereas – NO på engelsk ≡DE auf Englisch but≠EN in English • Preposition i ‘in’ – RU (overuse) vs. PL (underuse) – Investigate using error analysis ➔ Any suggestions??? $i PL EN DE NO NL RU X X X Y Y Y Y X X X X PL EN DE SH NO RU Y Y X X X X X PL EN DE SO NO RU Y Y Y X X X X X PL EN SP DE NO RU Y Y X X X X X PL EN DE NO SQ RU X X X Y Y Y X X X X PL EN DE NO VI RU X X X Y Y Y Y X X X X $på DE RU NO NL PL EN Y Y Y Y Y X X X X X DE RU NO SH PL EN Y Y Y Y Y X X X X X SO DE RU NO PL EN X X X X Y Y Y Y X X X X DE RU NO SP PL EN Y Y Y Y Y X X X X X DE SQ RU NO PL EN Y Y Y Y Y X X X X X DE RU NO VI PL EN Y Y Y Y Y X X X X X Prepositions, especially spatial prepositions, are renowned for being “among the hardest expressions to acquire when learning a second language” (Coventry & Garrod 2004: 4) and they have already been the subject of some interesting work based on ASK (Szymanska 2010; Malcher 2011).
    18. 18. #10 Prepositions til and fra • Preposition til ‘to’ – underused by all L1 groups, especially DE, SH and SQ – … • Preposition fra ‘from’ – used statistically more often by EN speakers than by PL or native speakers – … ➔ Any suggestions here??? $til DE RU PL NL EN NO Y Y Y Y Y X X X X X SH DE RU PL EN NO Y Y Y Y X X X X X DE RU SO PL EN NO Y Y Y Y Y X X X X X DE RU SP PL EN NO Y Y Y Y Y X X X X X SQ DE RU PL EN NO Y Y Y Y X X X X X DE RU PL VI EN NO Y Y Y Y Y X X X X X $fra NO PL DE NL RU EN X X X X Y Y Y Y X X X NO PL SH DE RU EN X X Y Y Y X X X X NO PL DE SO RU EN X X X X Y Y Y Y X X X NO PL DE SP RU EN X X X X Y Y Y Y X X X NO PL DE SQ RU EN X X X X Y Y Y Y X X X NO PL DE VI RU EN X X X X Y Y Y Y X X X X
    19. 19. #11 Underuse and overuse of og • Striking contrast between PL speakers (underuse) and RU speakers (overuse) – Cannot be formal transfer, since PL i and RU и are phonologically identical • Different token frequencies in L1s? – Wiktionary frequency lists (WFREQ)* • RU и ranked as #1 • PL i ranked as #2 (after w ‘in’) – Raw frequencies not comparable in WFREQ • Zipfian distribution? • Requires further investigation ➔ Your suggestions??? PL DE NL EN NO RU Y Y X X X X PL DE SH EN NO RU X X Y Y X X X X PL DE EN SO NO RU X X Y Y Y X X X X PL SP DE EN NO RU X X Y Y X X X X PL SQ DE EN NO RU X X Y Y X X X X VI PL DE EN NO RU X X Y Y X X X X *
    20. 20. #12 Overuse and underuse of eller • DE and EN speakers overuse eller ‘or’ – Difference w.r.t. to NL is highly statistical • This seems odd. (Are the Dutch more decisive than the English and Germans?) – Difference between DE and NO also statistical – Frequency related? • Mutual correspondence between NO eller and EN ‘or’ is 84% • RU speakers underuse eller – Strong formal resemblance with или (ili) • Possible cross-linguistic explanation – или has a more restricted distribution – Not used in negative contexts он не любит ни футбол, ни теннис ‘he doesn’t like football or tennis’ RU NO NL PL EN DE X X Y Y Y Y X X X X RU SH NO PL EN DE X X Y Y Y X X X X RU SO NO PL EN DE X X Y Y Y X X X X RU NO PL SP EN DE X X Y Y Y Y X X X RU SQ NO PL EN DE X X Y Y Y X X X X RU VI NO PL EN DE X X Y Y Y X X X X
    21. 21. #13 More general questions • Misclassification can also be revealing – Texts written by EN learners are more often misclassified as SP, rather than NL or DE, despite EN being more closely related to the latter ➔ Why??? – Texts by SO and SQ learners are most often misclassified as RU, whilst texts by VI learners are most often misclassified as PL ➔ Again, why??? • All the 12 patterns discussed above pertain to Indo-European languages most closely related to NO (DE, EN, NL; PL, RU) – There no really clear-cut predictors for the most distantly related L1s, i.e. SO, SQ and VI ➔ Why???
    22. 22. Conclusion • Discriminant analysis reveals subtle patterns of L2 usage that would otherwise go undetected • Homogeneity tables based on Tukey’s HSD can help us understand those patterns • Contrastive analysis is required in order to confirm that the patterns are due to cross-linguistic influence • All 13 issues discussed in this chapter are suitable topics for further research using ASK • This study has merely scratched the surface…
    23. 23. 13 research questions 1. Why do NL speakers overuse skal? 2. Why do DE speakers overuse en? 3. Why do EN speakers overuse et? 4. Why do PL and RU speakers differ so much in their use of den and det? 5. Why do EN speakers overuse er? 6. Why do RU speakers underuse er? 7. Why do many L1 groups underuse være? 8. Why do EN speakers, on the other hand, overuse være? 9. Why do EN speakers overuse på, while DE speakers underuse it? And why do RU speakers overuse i, while PL speakers underuse it? 10. Why do all L1 groups underuse til – and why do EN speakers overuse fra? 11. Why do PL and RU speakers differ so markedly in their use of og? 12. Why do EN and DE speakers overuse eller and why do RU speakers underuse it? 13. What lies behind the misclassification patterns, and why are there no good predictors for SO, SQ and VI?
    24. 24. References Donaldson, Bruce. 1997. Dutch: A Comprehensive Grammar. London: Routledge. Granger, Sylviane. 1996. From CA to CIA and back: An integrated approach to computerized bilingual and learner corpora. In Karin Aijmer, Bengt Altenberg and Mats Johansson (eds.) Languages in Contrast. Papers from a symposium on text-based cross-linguistic studies. Lund 4–5 March 1994. Lund: Lund University Press [Lund Studies in English 88], 37–51. Greftegreff, Liv Astrid. 1985. Enkel norsk grammatikk. Oslo: NKS-Forlaget. Husby, Olaf. 1999. En kort innføring i albansk. Trondheim: Tapir. Husby, Olaf. 2001. En kort innføring i somali. Trondheim: Tapir. Jarvis, Scott. 2010. Comparison-based and detection-based approaches to transfer research. EUROSLA Yearbook 10, 169 192.‑ Jarvis, Scott & Scott A. Crossley (eds.) 2012. Approaching Language Transfer through Text Classification. Explorations in the detection-based approach. Bristol: Multilingual Matters. Koolhoven, H. 1961. Teach yourself Dutch. London: The English Universities Press. Lie, Svein. 2005. Kontrastiv grammatikk – med norsk i sentrum, 3rd Edition. Oslo: Novus. Malcher, Jenny. 2011. Jeg liker å treffe folk i café. Man må nyter de fine tingene på verden! Preposisjoner og morsmålstransfer – en korpusbasert studie med i og på i fokus. Masters thesis, Department of Linguistics and Scandinavian Studies, University of Oslo. Mønnesland, Svein. 1990. Serbokroatisk-norsk kontrastiv grammatikk. In Hvenekilde, Anne (ed.) Med to språk: Fem kontrastive språkstudier for lærere. Oslo: Cappelen. Saaed, John Ibrahim. 1993. Somali Reference Grammar, 2nd Edition. Kensington, MD: Dunwoody Press. Szymanska, Oliwia. 2010b. A conceptual approach towards the use of prepositional phrases in Norwegian – the case of i and på. Folia Scandinavica 11, 173-183. Wade, Terence. 2011. A Comprehensive Russian Grammar. Wiley: Malden MA. Wiull, Hans Olaf. 2007. Bli bedre i norsk – se forskjellene mellom norsk og vietnamesisk. Oslo: VOX.