CORPUS LINGUISTICS
Corpus linguistics as Applied Linguistics
& Teaching / Language corpora
(based on P. Rayson’s & D. Archer’s course for H & S linguistics
Lancaster University & U. of Central Lanscashire)
A corpus can be defined as a collection of texts
assumed to be representative of a given language
put together so that it can be used for linguistic
analysis. Usually the assumption is that the
language stored in a corpus is naturally-occurring,
that is gathered according to explicit design
criteria, with a specific purpose in mind, and with
a claim to represent natural chunks of language
selected according to specific typology
Tognini-Bonelli (2001:2)
“nowadays the term 'corpus' nearly
always implies the additional
feature of 'machine-readable'”.
McEnery & Wilson, Corpus Linguistics. Online manual.
Types of (electronic)
corpora
Based on: http://www.uow.edu.au/~dlee/CBLLinks.htm
ENGLISH CORPORA: GENERAL LANGUAGE
CORPORA
The 1980s:
-Bank of English
-monitor corpus
-both spoken and written text
-different regional varieties of English
-British National Corpus (BNC)
-90 million written words
-10 million spoken words
-freely accessible (search interface and KWIC)
The 1990s and 2000s
-speech corpora:
-sound recordings (started in previous corpora)
-SPOKEN ENGLISH CORPUS (e.g., MICASE)
-detailed description of spoken phenomena: phonology,
prosody (stress, tone units…), etc
-multimedia corpora:
-transcripts synchronised audio/video recordings
-TALKBANK Website: SANTA BARBARACORPUS
OF SPOKEN AMERICAN ENGLISH (SBCSAE)
The 1990s and 2000s
-parsed corpora:
-syntactically analysed
-SURFACE AND UNDERLYING STRUCTURAL
ANALYSES AND NATURALISTIC ENGLISH CORPUS
(SUSANNE)
-historical / diachronic corpora:
-English of different periods (e.g., Helsinki)
-may cover specific historical periods or genres (e.g.,
COCA)
-track and describe how language has evolved
-A REPRESENTATIVE CORPUS OF HISTORICAL
ENGLISH REGISTERS (ARCHER)
The 1990s and 2000s
-specialised corpora:
-focus on concrete genres/domains
-BUSINESS LETTERS CORPUS (BLC) (e.g., Someya)
-lingua franca corpora:
-ENGLISH AS A LINGUA FRANCA IN ACADEMIC
SETTINGS (ELFA) CORPUS
-intercultural exchanges among speakers who use
English as a lingua franca (also, e.g., ICLE)
The 1990s and 2000s
-developmental language corpora:
-non-adult English native speakers' output (e.g.,
CHILDES)
-not as proficient as native-speaker corpora
-POLYTECHNIC OF WALES (POW) CORPUS
-ESL/EFL learner corpora:
-learners of English's output
-one and the same L1 background or different mother
tongues
-JAPANESE EFL LEARNER CORPUS (JEFLL)
The 2010s
On-line corpora + management tools on-
line (e.g., Sketch Engine)
Corpora builders on-line
N-gram and conc-gram builders
Literature and corpora
Other uses: Socio-cultural studies,
historical, Translation, Law, (?)
WORDSMITH: how to manage corpora
-Computer program which permits users to compile their
own corpus
-Texts must be in .txt (doc, htm) format
-Any text can be subjected to the same process of
analysis that official corpora undergo: concordance
lines, word lists, etc
-No need to pre-process such texts in advance (e.g., XML
coding or exhaustive tagging)
Corpus linguistics: Why applied linguistics?
-Insights into the internal workings of real language
-Knowledge in turn also used in other fields of enquiry
-Planning, designing, compiling and tagging
-Frequency lists and concordance lines (+further analysis)
-Sinclair’s (2003) “degeneralisation”:
-sceptical about 'received' descriptions
-patterns found in the data: more precise or alternative descriptions
-Corpus-based dictionaries and grammars
-how lexis and grammar are “really” used
-COLLINS COBUILD LEARNER'S DICTIONARY
-THE LONGMAN GRAMMAR OF SPOKEN AND WRITTEN
ENGLISH
Underlying assumption
Intuition is not enough to study language
…
Reaction to Noam Chomsky’s focus
on introspection in 1950s/60s
Empirical observation of naturally
occurring data versus theory of how
human language processing is actually undertaken
REPRESENTATIVENESS is Key:
i.e. a balanced sample of a language or a
particular variety of language --- c.f.
national corpora (British, American,
Czech, Polish …)
Reasoning?
Helps to remove intuitive bias
Helps us to find common/ rare phenomena
AND SIZE IS ALSO IMPORTANT
Brown/LOB
1960s
1 million
BNC
1990s
100 million
Web
Present day
? billion
Birmingham
corpus
1980
10 million
Collins Bank of English
Cambridge International Corpus
Oxford English Corpus
2006
600 million – 1 billion
Web
Future
? billions
So what is corpus linguistics?
= the “study of language using corpora”
= empirical methodology
= a useful means of exploring:
Synchronic and diachronic variation
Syntax, semantics, pragmatics
Lexicography
Dialects, minority languages
Not just English
Corpus techniques we utilise
Retrieval
Frequency profiling
Concordancing
Collocations
Key words
Key domains
Annotation
POS tagging
Semantic tagging
Annotation
Part of speech
tagging
Semantic field
tagging
Retrieval
Frequency lists
Concordances
Key words
Text
Keywords
Text or
reference
corpus
What are “key words”?
And why are they so useful?
Key words
Word Clouds
If we compare
text A
… with text B … we can discover the most
significant items within text A
… and not only
the frequent items
Collocations
Collocation = a relationship between words that tend to
occur together in texts
Words that tend to occur near word X are the collocates
of word X (consider “fish and XXXXX”)
Based on frequency (how frequent separate vs. how
frequent together = T-SCORE & mutual information)
The company a word keeps: implicit associations or
assumptions (also “semantic prosody”, cf. Sinclair, 2002)
e.g., Bachelor: eligible, flat, life, days
Spinster: elderly, widows, sisters, parish
Corpus software
Our scenario
SEM TAGGER
POS TAGGER
Book Search
Other texts – not compiled for
corpus linguistics
Linguistic analysis
Natural language
processing &
Computational linguistics
Corpus Empirical evidence
to inform linguistStatistical and rule-
based language
models
Corpus Linguistics
Theories on
language /
communication
Theories on
language /
communication
Corpus linguistics for the analysis of real linguistic /
paralinguistic / extra-linguistic information
feedback
space for
our own
annotation
some mark-
up for context
audiovisual
element
Task for you:
What corpus steps / resources would you have to
go through in order to…?
**Find the most frequent word in spoken English?
++Identify different collocates with “news” in
written English
>>Determine the most frequent verb tense used in
Business letters
// Determine the key errors that your students make
by comparison with other students’ errors
CORPORA and LANGUAGE TEACHING /
LEARNING
-Mixture between instructional and naturalistic LL
-Fulfilment of both the input and output hypotheses
-”Scaffolding” (though loosely speaking)
-insights concerning English culture(s)
-Student-centred and related to constructivism:
mastering corpora = learning autonomy
CORPUS-BASED ESL/EFL ACTIVITIES
-Focus on lexis, grammar and register
-introductory notions concerning collocation,
colligation, and formal vs. informal
-For already motivated students: corpora on-line
Activity one:
contractions, formal or
informal?
spoken or written?
1* ?’?? 2
34
Quotation marks!
Activity two:
Corpora as a source of
knowledge concerning
collocation and colligation
[v*] mistakes
powerful, not strong!!![aj*]
Activity three:
meaning via collocations and
co-text
Activity four: Lexical descriptions
**Contact with the English language:
input (at least lexis-wise)
author
corpus
reference
corpus
Select the text
you want a list
of
Save both lists to
compare them
with Keyword
author
corpus list
reference
corpus list
Some concluding remarks = Task for you:
See what kinds of corpus things you would have
to do to:
>> Demonstrate that the use of some verbs is
incorrect
>> Investigate real lexical levels needed for
different learning stages (e.g., beginner versus
intermediate)
** Determine the main grammatical structures in
conversation that learners may need
Adolphs, Svenja (2006). Introducing electronic text analysis: a practical
guide for language and literary studies. London: Routledge.
Biber, Douglas, Conrad, Susan, & Reppen, Randi. (1998). Corpus
Linguistics: Investigating language structure and use. Cambridge: CUP.
Hockey, Susan. (2000). Electronic texts in the humanities: Principles and
practice. Oxford: Oxford University Press.
Hunston, Susan. (2002). Corpora in applied linguistics. Cambridge:
Cambridge University Press.
Kennedy, Graeme D. (1998). An introduction to corpus linguistics.
London: Longman.
McEnery, Tony, & Wilson, Andrew. (2001). Corpus linguistics (2nd
Ed.). Edinburgh : Edinburgh University Press.
Meyer, Charles. (2002). English corpus linguistics: An introduction.
Cambridge: Cambridge University Press.
Ooi, Vincent B.Y. (1998). Computer Corpus Lexicography. Cambridge
Sampson, Geoffrey & McCarthy, Diana (Eds.). (2004). Corpus
linguistics. London: Continuum.
Scott, Mike, & Tribble, Chris. (2006). Textual patterns: Keyword and
corpus analysis in language education, Amsterdam: Benjamins.
Sinclair, John (1991) Corpus, Concordance, Collocation. Oxford
UCREL website (many others … 1991 – 1999 / 2000 – 2005 / (…) ?

Corpus linguistics intro

  • 1.
    CORPUS LINGUISTICS Corpus linguisticsas Applied Linguistics & Teaching / Language corpora (based on P. Rayson’s & D. Archer’s course for H & S linguistics Lancaster University & U. of Central Lanscashire)
  • 2.
    A corpus canbe defined as a collection of texts assumed to be representative of a given language put together so that it can be used for linguistic analysis. Usually the assumption is that the language stored in a corpus is naturally-occurring, that is gathered according to explicit design criteria, with a specific purpose in mind, and with a claim to represent natural chunks of language selected according to specific typology Tognini-Bonelli (2001:2)
  • 3.
    “nowadays the term'corpus' nearly always implies the additional feature of 'machine-readable'”. McEnery & Wilson, Corpus Linguistics. Online manual.
  • 4.
    Types of (electronic) corpora Basedon: http://www.uow.edu.au/~dlee/CBLLinks.htm
  • 5.
    ENGLISH CORPORA: GENERALLANGUAGE CORPORA The 1980s: -Bank of English -monitor corpus -both spoken and written text -different regional varieties of English -British National Corpus (BNC) -90 million written words -10 million spoken words -freely accessible (search interface and KWIC)
  • 6.
    The 1990s and2000s -speech corpora: -sound recordings (started in previous corpora) -SPOKEN ENGLISH CORPUS (e.g., MICASE) -detailed description of spoken phenomena: phonology, prosody (stress, tone units…), etc -multimedia corpora: -transcripts synchronised audio/video recordings -TALKBANK Website: SANTA BARBARACORPUS OF SPOKEN AMERICAN ENGLISH (SBCSAE)
  • 7.
    The 1990s and2000s -parsed corpora: -syntactically analysed -SURFACE AND UNDERLYING STRUCTURAL ANALYSES AND NATURALISTIC ENGLISH CORPUS (SUSANNE) -historical / diachronic corpora: -English of different periods (e.g., Helsinki) -may cover specific historical periods or genres (e.g., COCA) -track and describe how language has evolved -A REPRESENTATIVE CORPUS OF HISTORICAL ENGLISH REGISTERS (ARCHER)
  • 8.
    The 1990s and2000s -specialised corpora: -focus on concrete genres/domains -BUSINESS LETTERS CORPUS (BLC) (e.g., Someya) -lingua franca corpora: -ENGLISH AS A LINGUA FRANCA IN ACADEMIC SETTINGS (ELFA) CORPUS -intercultural exchanges among speakers who use English as a lingua franca (also, e.g., ICLE)
  • 9.
    The 1990s and2000s -developmental language corpora: -non-adult English native speakers' output (e.g., CHILDES) -not as proficient as native-speaker corpora -POLYTECHNIC OF WALES (POW) CORPUS -ESL/EFL learner corpora: -learners of English's output -one and the same L1 background or different mother tongues -JAPANESE EFL LEARNER CORPUS (JEFLL)
  • 10.
    The 2010s On-line corpora+ management tools on- line (e.g., Sketch Engine) Corpora builders on-line N-gram and conc-gram builders Literature and corpora Other uses: Socio-cultural studies, historical, Translation, Law, (?)
  • 11.
    WORDSMITH: how tomanage corpora -Computer program which permits users to compile their own corpus -Texts must be in .txt (doc, htm) format -Any text can be subjected to the same process of analysis that official corpora undergo: concordance lines, word lists, etc -No need to pre-process such texts in advance (e.g., XML coding or exhaustive tagging)
  • 12.
    Corpus linguistics: Whyapplied linguistics? -Insights into the internal workings of real language -Knowledge in turn also used in other fields of enquiry -Planning, designing, compiling and tagging -Frequency lists and concordance lines (+further analysis) -Sinclair’s (2003) “degeneralisation”: -sceptical about 'received' descriptions -patterns found in the data: more precise or alternative descriptions -Corpus-based dictionaries and grammars -how lexis and grammar are “really” used -COLLINS COBUILD LEARNER'S DICTIONARY -THE LONGMAN GRAMMAR OF SPOKEN AND WRITTEN ENGLISH
  • 13.
    Underlying assumption Intuition isnot enough to study language … Reaction to Noam Chomsky’s focus on introspection in 1950s/60s Empirical observation of naturally occurring data versus theory of how human language processing is actually undertaken
  • 14.
    REPRESENTATIVENESS is Key: i.e.a balanced sample of a language or a particular variety of language --- c.f. national corpora (British, American, Czech, Polish …) Reasoning? Helps to remove intuitive bias Helps us to find common/ rare phenomena
  • 15.
    AND SIZE ISALSO IMPORTANT Brown/LOB 1960s 1 million BNC 1990s 100 million Web Present day ? billion
  • 16.
    Birmingham corpus 1980 10 million Collins Bankof English Cambridge International Corpus Oxford English Corpus 2006 600 million – 1 billion Web Future ? billions
  • 17.
    So what iscorpus linguistics? = the “study of language using corpora” = empirical methodology = a useful means of exploring: Synchronic and diachronic variation Syntax, semantics, pragmatics Lexicography Dialects, minority languages Not just English
  • 18.
    Corpus techniques weutilise Retrieval Frequency profiling Concordancing Collocations Key words Key domains Annotation POS tagging Semantic tagging
  • 19.
    Annotation Part of speech tagging Semanticfield tagging Retrieval Frequency lists Concordances
  • 20.
    Key words Text Keywords Text or reference corpus Whatare “key words”? And why are they so useful?
  • 21.
    Key words Word Clouds Ifwe compare text A … with text B … we can discover the most significant items within text A … and not only the frequent items
  • 22.
    Collocations Collocation = arelationship between words that tend to occur together in texts Words that tend to occur near word X are the collocates of word X (consider “fish and XXXXX”) Based on frequency (how frequent separate vs. how frequent together = T-SCORE & mutual information) The company a word keeps: implicit associations or assumptions (also “semantic prosody”, cf. Sinclair, 2002) e.g., Bachelor: eligible, flat, life, days Spinster: elderly, widows, sisters, parish
  • 23.
  • 24.
  • 25.
    Book Search Other texts– not compiled for corpus linguistics
  • 26.
    Linguistic analysis Natural language processing& Computational linguistics Corpus Empirical evidence to inform linguistStatistical and rule- based language models Corpus Linguistics Theories on language / communication Theories on language / communication Corpus linguistics for the analysis of real linguistic / paralinguistic / extra-linguistic information feedback
  • 27.
    space for our own annotation somemark- up for context audiovisual element
  • 28.
    Task for you: Whatcorpus steps / resources would you have to go through in order to…? **Find the most frequent word in spoken English? ++Identify different collocates with “news” in written English >>Determine the most frequent verb tense used in Business letters // Determine the key errors that your students make by comparison with other students’ errors
  • 29.
    CORPORA and LANGUAGETEACHING / LEARNING -Mixture between instructional and naturalistic LL -Fulfilment of both the input and output hypotheses -”Scaffolding” (though loosely speaking) -insights concerning English culture(s) -Student-centred and related to constructivism: mastering corpora = learning autonomy
  • 30.
    CORPUS-BASED ESL/EFL ACTIVITIES -Focuson lexis, grammar and register -introductory notions concerning collocation, colligation, and formal vs. informal -For already motivated students: corpora on-line
  • 31.
    Activity one: contractions, formalor informal? spoken or written?
  • 32.
  • 33.
  • 34.
    Activity two: Corpora asa source of knowledge concerning collocation and colligation
  • 35.
  • 37.
  • 38.
    Activity three: meaning viacollocations and co-text
  • 40.
    Activity four: Lexicaldescriptions **Contact with the English language: input (at least lexis-wise)
  • 41.
  • 42.
    Select the text youwant a list of
  • 43.
    Save both liststo compare them with Keyword
  • 44.
  • 47.
    Some concluding remarks= Task for you: See what kinds of corpus things you would have to do to: >> Demonstrate that the use of some verbs is incorrect >> Investigate real lexical levels needed for different learning stages (e.g., beginner versus intermediate) ** Determine the main grammatical structures in conversation that learners may need
  • 48.
    Adolphs, Svenja (2006).Introducing electronic text analysis: a practical guide for language and literary studies. London: Routledge. Biber, Douglas, Conrad, Susan, & Reppen, Randi. (1998). Corpus Linguistics: Investigating language structure and use. Cambridge: CUP. Hockey, Susan. (2000). Electronic texts in the humanities: Principles and practice. Oxford: Oxford University Press. Hunston, Susan. (2002). Corpora in applied linguistics. Cambridge: Cambridge University Press. Kennedy, Graeme D. (1998). An introduction to corpus linguistics. London: Longman. McEnery, Tony, & Wilson, Andrew. (2001). Corpus linguistics (2nd Ed.). Edinburgh : Edinburgh University Press. Meyer, Charles. (2002). English corpus linguistics: An introduction. Cambridge: Cambridge University Press. Ooi, Vincent B.Y. (1998). Computer Corpus Lexicography. Cambridge Sampson, Geoffrey & McCarthy, Diana (Eds.). (2004). Corpus linguistics. London: Continuum. Scott, Mike, & Tribble, Chris. (2006). Textual patterns: Keyword and corpus analysis in language education, Amsterdam: Benjamins. Sinclair, John (1991) Corpus, Concordance, Collocation. Oxford UCREL website (many others … 1991 – 1999 / 2000 – 2005 / (…) ?