Corpus Linguistics for
Language Learning and Teaching
Martin Wynne martin.wynne@bodleian.ox.ac.uk
Bodleian Libraries
Faculty of Linguistics, Philology and Phonetics
The 'aftermath' of the seminar
Subject: Les Francais des Corpus – Aftermath
Dear colleagues,
First, many thanks for presenting at /attending
the Francais des Corpus Workshop and for making
it such a success.
I promised I would keep you in touch with one
another and hope that the full list of your e-
mail addresses above makes that possible.
…
'aftermath'
Collocates:
War
Gulf
coup
World
disaster
Tiananmen
death
revolution
defeat
Chernobyl
affair
riots
battle
massacre
wars
election
Crisis
events
explosion
invasion
trial
fire
June
Square
victory
accident
attempt
Significant collocates in the British National Corpus
(a representative corpus of British English released in 1994).
BNCWeb parameters:
There are 1486 different types in your collocation database
for the query "[word="aftermath"%c] [word="of"%c]".
(Your query "aftermath of" returned 544 hits in 337 different texts)
The selected range was 1 to 4.
Corpus basis for calculation: the whole BNC.
Type of calculation: Log-likelihood
Tag restriction: any noun
Collocates occur at least 5 times in the whole BNC.
Words collocate at least 5 times.
What is a corpus?
“…a collection of pieces of language, selected and ordered according
to explicit linguistic criteria in order to be used as a sample of the
language.”
(Sinclair 1996)
What is Corpus Linguistics?
(1) Focus on linguistic performance, rather than competence
(2) Focus on linguistic description, rather than linguistic
universals
(3) Focus on quantitative, as well as qualitative models of
language
(4) Focus on a more empiricist, rather than rationalist view of
scientific inquiry.
(Leech 1992)
How do you know things about
language? Where do we get our
knowledge from?
What does your knowledge and
experience tell you about the use of
‘try to’ & ‘try and’?
Fill in the blanks
1. Did you try … talk her out of swimming?
2. Mr. Kissinger, try … explain to us what might happen
3. He did it to try … score points
4. They both wanted to try ... have a family
5. They try … treat you like machines
6. Sometimes, people try … make fun of you by
imitating you.
7. Now the government will try … sell all of this.
8. Did you try … get out of it?
9. I will try … understand this.
Fill in the blanks
1. Did you try and talk her out of swimming?
2. Mr. Kissinger, try and explain to us what might happen
3. He did it to try and score points
4. They both wanted to try and have a family
5. They try to treat you like machines
6. Sometimes, people try to make fun of you by imitating you.
7. Now the government will try to sell all of this.
8. Did you try to get out of it?
9. I will try ? understand this. [This one was made up!]
•
“Try and do something is incorrect
for try to do…” [Partridge and Greet
1947]
•
“Try and is well established in
conversational use ..Try to is to be
preferred in serious writing” [Plain
Words 1986]
•
“… try and has been socially
acceptable for these two centuries
… is not used in an elevated style”
[Webster’s Dictionary 1989]
What are the factors governing the choice
and distribution of try to vs. try and ?
How would you investigate this question?
Spoken
British
English
W
ritten
British
EnglishSpoken
Am
erican
EnglishW
ritten
Am
erican
English
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
try to
try and
Based on CobuildDirect and Longman Spoken American Corpus.
BNC COCA Hansard GloWbE COHA soap operas wikipedia
0
50
100
150
200
250
300
350
try and (pmw)
try to (pmw)
Try to or try and? Verb
complementation in British and
American English
Hommerberg & Tottie (2007)
ICAME Journal 31:45-64
http://icame.uib.no/ij31/ij31-page45-64.pdf
Uses of corpus linguistics in
language pedagogy
●
Developing new theories (e.g. differences between regional
varieties, identifying new varieties such as 'English as a
Lingua Franca')
●
A source primary data for developing e.g.:
➢
dictionaries
➢
grammars
➢
textbooks (and other teaching materials)
●
Preparing materials for classes (e.g. as a source of
examples)
●
Studying learner language (in a learner corpus)
●
Data-driven learning in the classroom
“Why not just Google it?”
Linguistic
●
Biased distribution of text-types and genres
●
Repeated and reused text
●
Unknown provenance (“Who wrote this, when, and why?”)
●
Mixture of native and non-native producers of language
●
Mixture of varieties
Technical
●
Unclear separation of elements of the webpage (body text, sidebars, adverts, etc.)
●
Accessing the ‘hidden web’ (content which is not visible to search)
●
Accessing language embedded in audio and video streams
●
Lack of persistence locations and identifiers
Methodological
●
Difficult to compare frequencies of occurrence
●
Unknown (or undesirable) sampling and ranking strategies
of search engines (e.g. promoting commercial products and services, prioritizing
words in titles and headers, user-specific settings)
Problems with language in the corpus
●
Limited by copyright (and other legal and ethical barriers)
●
Expensive, time-consuming and slow to make
●
Limited size
●
Not up to date
●
Incomplete information about provenance and context
●
Design decisions were made by someone else
●
Not easily comparable to other corpora
●
Access restrictions
●
Limited functions available for analysis and exploration
●
Not connected to other resources or tools
●
Difficult to deploy in the classroom
Find evidence in one or more corpora to
help explain the sources of irony and
humour in Homer’s utterance:
“I'm just going out to commit certain deeds”
http://kisscartoon.eu/watch/the-simpsons-season-9-episode-16-dumbbell-indemnity/
(9:27-9:52)
Exercise
Links for Practical Work
●
http://bncweb.lancs.ac.uk/ (register using ac.uk email address)
●
http://corpus.byu.edu
●
All links can be found via: https://ota.ox.ac.uk/oxonly/oxford.xml
Data-driven language learning in the
classroom: some reflections
●
Can you use a corpus to reveal 'real language'?
●
Do we want to teach ‘real language’? Should teachers prefer
to control the rate and order of exposure to linguistic
features?
●
Can teachers easily deal with unrestricted language in the
classroom?
●
Effective reading and interpretation of concordance lines and
collocation lists require practice, and the acquisition of skills.
●
There are often difficult technical issues in effective
deployment of corpora in the classroom.
Antconc
●
Download for free from
http://www.antlab.sci.waseda.ac.jp/software.html
●
Use with any 'plain' text (txt, html, xml)
●
Multilingual
capabilities
●
Does not interpret
mark-up or metadata
CQPweb: an online interface for many corpora
http://cqpweb.lancs.ac.uk
Finding resources
https://ota.ox.ac.uk/oxonly/oxford.xml
References
●
Chambers, A. and M. Wynne. ‘Sharing corpus resources in language learning.’ In F. Zhang and B. Barber (eds.)
Handbook of Research on Computer-Enhanced Language Acquisition and Learning. Hershey, PA: IGI Global,
2008, 438-451.
●
Hommerberg, C. and G. Tottie (2007). Try to or Try and? Verb complementation in British and American English.
ICAME Journal 31: 45-64. http://icame.uib.no/ij31/ij31-page45-64.pdf
●
Leech, G. (1992). Corpora and theories of linguistic performance. In J. Startvik (Ed.), Directions in corpus
linguistics (pp. 105-122). Berlin: Mouton de Gruyter.
●
McEnery, A. & Z. Xiao (2010), What corpora can offer in language teaching and learning. In E. Hinkel (ed.)
Handbook of Research in Second Language Teaching and Learning (Vol. 2). London / New York: Routledge.
[http://www.lancaster.ac.uk/fass/projects/corpus/ZJU/xpapers/McEnery_Xiao_teaching.PDF]
●
McEnery, A., R. Xiao and Y. Tono, (2006). Corpus-based Language Studies: An Advanced Resource Book.
Routledge.
●
Sinclair, J.McH. 1996. 'Preliminary recommendations on corpus typology' EAGLES Document TCWG-CTYP/P
(available from http://www.ilc.cnr.it/EAGLES/corpustyp/corpustyp.html).
Online resources
●
Corpora for users in the University of Oxford https://ota.ox.ac.uk/oxonly/oxford.xml
●
Brigham Young Corpora http://corpus.byu.edu/ (also via Solo)
●
British National Corpus http://bncweb.lancs.ac.uk/ (free registration required here)
●
Linguee (bilingual translations) https://www.linguee.com/
●
VOICE. 2013. The Vienna-Oxford International Corpus of English (version 2.0 Online) http://voice.univie.ac.at,
also available for download from the Oxford Text Archive (http://purl.ox.ac.uk/ota/2542).

Corpus Linguistics for Language Teaching and Learning

  • 1.
    Corpus Linguistics for LanguageLearning and Teaching Martin Wynne martin.wynne@bodleian.ox.ac.uk Bodleian Libraries Faculty of Linguistics, Philology and Phonetics
  • 2.
    The 'aftermath' ofthe seminar Subject: Les Francais des Corpus – Aftermath Dear colleagues, First, many thanks for presenting at /attending the Francais des Corpus Workshop and for making it such a success. I promised I would keep you in touch with one another and hope that the full list of your e- mail addresses above makes that possible. …
  • 3.
    'aftermath' Collocates: War Gulf coup World disaster Tiananmen death revolution defeat Chernobyl affair riots battle massacre wars election Crisis events explosion invasion trial fire June Square victory accident attempt Significant collocates inthe British National Corpus (a representative corpus of British English released in 1994). BNCWeb parameters: There are 1486 different types in your collocation database for the query "[word="aftermath"%c] [word="of"%c]". (Your query "aftermath of" returned 544 hits in 337 different texts) The selected range was 1 to 4. Corpus basis for calculation: the whole BNC. Type of calculation: Log-likelihood Tag restriction: any noun Collocates occur at least 5 times in the whole BNC. Words collocate at least 5 times.
  • 4.
    What is acorpus? “…a collection of pieces of language, selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language.” (Sinclair 1996)
  • 5.
    What is CorpusLinguistics? (1) Focus on linguistic performance, rather than competence (2) Focus on linguistic description, rather than linguistic universals (3) Focus on quantitative, as well as qualitative models of language (4) Focus on a more empiricist, rather than rationalist view of scientific inquiry. (Leech 1992)
  • 6.
    How do youknow things about language? Where do we get our knowledge from?
  • 7.
    What does yourknowledge and experience tell you about the use of ‘try to’ & ‘try and’?
  • 8.
    Fill in theblanks 1. Did you try … talk her out of swimming? 2. Mr. Kissinger, try … explain to us what might happen 3. He did it to try … score points 4. They both wanted to try ... have a family 5. They try … treat you like machines 6. Sometimes, people try … make fun of you by imitating you. 7. Now the government will try … sell all of this. 8. Did you try … get out of it? 9. I will try … understand this.
  • 9.
    Fill in theblanks 1. Did you try and talk her out of swimming? 2. Mr. Kissinger, try and explain to us what might happen 3. He did it to try and score points 4. They both wanted to try and have a family 5. They try to treat you like machines 6. Sometimes, people try to make fun of you by imitating you. 7. Now the government will try to sell all of this. 8. Did you try to get out of it? 9. I will try ? understand this. [This one was made up!]
  • 10.
    • “Try and dosomething is incorrect for try to do…” [Partridge and Greet 1947] • “Try and is well established in conversational use ..Try to is to be preferred in serious writing” [Plain Words 1986] • “… try and has been socially acceptable for these two centuries … is not used in an elevated style” [Webster’s Dictionary 1989]
  • 11.
    What are thefactors governing the choice and distribution of try to vs. try and ? How would you investigate this question?
  • 12.
  • 13.
    BNC COCA HansardGloWbE COHA soap operas wikipedia 0 50 100 150 200 250 300 350 try and (pmw) try to (pmw)
  • 14.
    Try to ortry and? Verb complementation in British and American English Hommerberg & Tottie (2007) ICAME Journal 31:45-64 http://icame.uib.no/ij31/ij31-page45-64.pdf
  • 15.
    Uses of corpuslinguistics in language pedagogy ● Developing new theories (e.g. differences between regional varieties, identifying new varieties such as 'English as a Lingua Franca') ● A source primary data for developing e.g.: ➢ dictionaries ➢ grammars ➢ textbooks (and other teaching materials) ● Preparing materials for classes (e.g. as a source of examples) ● Studying learner language (in a learner corpus) ● Data-driven learning in the classroom
  • 16.
    “Why not justGoogle it?” Linguistic ● Biased distribution of text-types and genres ● Repeated and reused text ● Unknown provenance (“Who wrote this, when, and why?”) ● Mixture of native and non-native producers of language ● Mixture of varieties Technical ● Unclear separation of elements of the webpage (body text, sidebars, adverts, etc.) ● Accessing the ‘hidden web’ (content which is not visible to search) ● Accessing language embedded in audio and video streams ● Lack of persistence locations and identifiers Methodological ● Difficult to compare frequencies of occurrence ● Unknown (or undesirable) sampling and ranking strategies of search engines (e.g. promoting commercial products and services, prioritizing words in titles and headers, user-specific settings)
  • 17.
    Problems with languagein the corpus ● Limited by copyright (and other legal and ethical barriers) ● Expensive, time-consuming and slow to make ● Limited size ● Not up to date ● Incomplete information about provenance and context ● Design decisions were made by someone else ● Not easily comparable to other corpora ● Access restrictions ● Limited functions available for analysis and exploration ● Not connected to other resources or tools ● Difficult to deploy in the classroom
  • 18.
    Find evidence inone or more corpora to help explain the sources of irony and humour in Homer’s utterance: “I'm just going out to commit certain deeds” http://kisscartoon.eu/watch/the-simpsons-season-9-episode-16-dumbbell-indemnity/ (9:27-9:52) Exercise
  • 19.
    Links for PracticalWork ● http://bncweb.lancs.ac.uk/ (register using ac.uk email address) ● http://corpus.byu.edu ● All links can be found via: https://ota.ox.ac.uk/oxonly/oxford.xml
  • 20.
    Data-driven language learningin the classroom: some reflections ● Can you use a corpus to reveal 'real language'? ● Do we want to teach ‘real language’? Should teachers prefer to control the rate and order of exposure to linguistic features? ● Can teachers easily deal with unrestricted language in the classroom? ● Effective reading and interpretation of concordance lines and collocation lists require practice, and the acquisition of skills. ● There are often difficult technical issues in effective deployment of corpora in the classroom.
  • 21.
    Antconc ● Download for freefrom http://www.antlab.sci.waseda.ac.jp/software.html ● Use with any 'plain' text (txt, html, xml) ● Multilingual capabilities ● Does not interpret mark-up or metadata
  • 22.
    CQPweb: an onlineinterface for many corpora http://cqpweb.lancs.ac.uk
  • 23.
  • 24.
    References ● Chambers, A. andM. Wynne. ‘Sharing corpus resources in language learning.’ In F. Zhang and B. Barber (eds.) Handbook of Research on Computer-Enhanced Language Acquisition and Learning. Hershey, PA: IGI Global, 2008, 438-451. ● Hommerberg, C. and G. Tottie (2007). Try to or Try and? Verb complementation in British and American English. ICAME Journal 31: 45-64. http://icame.uib.no/ij31/ij31-page45-64.pdf ● Leech, G. (1992). Corpora and theories of linguistic performance. In J. Startvik (Ed.), Directions in corpus linguistics (pp. 105-122). Berlin: Mouton de Gruyter. ● McEnery, A. & Z. Xiao (2010), What corpora can offer in language teaching and learning. In E. Hinkel (ed.) Handbook of Research in Second Language Teaching and Learning (Vol. 2). London / New York: Routledge. [http://www.lancaster.ac.uk/fass/projects/corpus/ZJU/xpapers/McEnery_Xiao_teaching.PDF] ● McEnery, A., R. Xiao and Y. Tono, (2006). Corpus-based Language Studies: An Advanced Resource Book. Routledge. ● Sinclair, J.McH. 1996. 'Preliminary recommendations on corpus typology' EAGLES Document TCWG-CTYP/P (available from http://www.ilc.cnr.it/EAGLES/corpustyp/corpustyp.html). Online resources ● Corpora for users in the University of Oxford https://ota.ox.ac.uk/oxonly/oxford.xml ● Brigham Young Corpora http://corpus.byu.edu/ (also via Solo) ● British National Corpus http://bncweb.lancs.ac.uk/ (free registration required here) ● Linguee (bilingual translations) https://www.linguee.com/ ● VOICE. 2013. The Vienna-Oxford International Corpus of English (version 2.0 Online) http://voice.univie.ac.at, also available for download from the Oxford Text Archive (http://purl.ox.ac.uk/ota/2542).