Corpus linguistics intro

CORPUS LINGUISTICS
Corpus linguistics as Applied Linguistics
& Teaching / Language corpora
(based on P. Rayson’s & D. Archer’s course for H & S linguistics
Lancaster University & U. of Central Lanscashire)

A corpus can be defined as a collection of texts
assumed to be representative of a given language
put together so that it can be used for linguistic
analysis. Usually the assumption is that the
language stored in a corpus is naturally-occurring,
that is gathered according to explicit design
criteria, with a specific purpose in mind, and with
a claim to represent natural chunks of language
selected according to specific typology
Tognini-Bonelli (2001:2)

“nowadays the term 'corpus' nearly
always implies the additional
feature of 'machine-readable'”.
McEnery & Wilson, Corpus Linguistics. Online manual.

Types of (electronic)
corpora
Based on: http://www.uow.edu.au/~dlee/CBLLinks.htm

ENGLISH CORPORA: GENERAL LANGUAGE
CORPORA
The 1980s:
-Bank of English
-monitor corpus
-both spoken and written text
-different regional varieties of English
-British National Corpus (BNC)
-90 million written words
-10 million spoken words
-freely accessible (search interface and KWIC)

The 1990s and 2000s
-speech corpora:
-sound recordings (started in previous corpora)
-SPOKEN ENGLISH CORPUS (e.g., MICASE)
-detailed description of spoken phenomena: phonology,
prosody (stress, tone units…), etc
-multimedia corpora:
-transcripts synchronised audio/video recordings
-TALKBANK Website: SANTA BARBARACORPUS
OF SPOKEN AMERICAN ENGLISH (SBCSAE)

The 1990s and 2000s
-parsed corpora:
-syntactically analysed
-SURFACE AND UNDERLYING STRUCTURAL
ANALYSES AND NATURALISTIC ENGLISH CORPUS
(SUSANNE)
-historical / diachronic corpora:
-English of different periods (e.g., Helsinki)
-may cover specific historical periods or genres (e.g.,
COCA)
-track and describe how language has evolved
-A REPRESENTATIVE CORPUS OF HISTORICAL
ENGLISH REGISTERS (ARCHER)

The 1990s and 2000s
-specialised corpora:
-focus on concrete genres/domains
-BUSINESS LETTERS CORPUS (BLC) (e.g., Someya)
-lingua franca corpora:
-ENGLISH AS A LINGUA FRANCA IN ACADEMIC
SETTINGS (ELFA) CORPUS
-intercultural exchanges among speakers who use
English as a lingua franca (also, e.g., ICLE)

The 1990s and 2000s
-developmental language corpora:
-non-adult English native speakers' output (e.g.,
CHILDES)
-not as proficient as native-speaker corpora
-POLYTECHNIC OF WALES (POW) CORPUS
-ESL/EFL learner corpora:
-learners of English's output
-one and the same L1 background or different mother
tongues
-JAPANESE EFL LEARNER CORPUS (JEFLL)

The 2010s
On-line corpora + management tools on-
line (e.g., Sketch Engine)
Corpora builders on-line
N-gram and conc-gram builders
Literature and corpora
Other uses: Socio-cultural studies,
historical, Translation, Law, (?)

WORDSMITH: how to manage corpora
-Computer program which permits users to compile their
own corpus
-Texts must be in .txt (doc, htm) format
-Any text can be subjected to the same process of
analysis that official corpora undergo: concordance
lines, word lists, etc
-No need to pre-process such texts in advance (e.g., XML
coding or exhaustive tagging)

Corpus linguistics: Why applied linguistics?
-Insights into the internal workings of real language
-Knowledge in turn also used in other fields of enquiry
-Planning, designing, compiling and tagging
-Frequency lists and concordance lines (+further analysis)
-Sinclair’s (2003) “degeneralisation”:
-sceptical about 'received' descriptions
-patterns found in the data: more precise or alternative descriptions
-Corpus-based dictionaries and grammars
-how lexis and grammar are “really” used
-COLLINS COBUILD LEARNER'S DICTIONARY
-THE LONGMAN GRAMMAR OF SPOKEN AND WRITTEN
ENGLISH

Underlying assumption
Intuition is not enough to study language
…
Reaction to Noam Chomsky’s focus
on introspection in 1950s/60s
Empirical observation of naturally
occurring data versus theory of how
human language processing is actually undertaken

REPRESENTATIVENESS is Key:
i.e. a balanced sample of a language or a
particular variety of language --- c.f.
national corpora (British, American,
Czech, Polish …)
Reasoning?
Helps to remove intuitive bias
Helps us to find common/ rare phenomena

AND SIZE IS ALSO IMPORTANT
Brown/LOB
1960s
1 million
BNC
1990s
100 million
Web
Present day
? billion

Birmingham
corpus
1980
10 million
Collins Bank of English
Cambridge International Corpus
Oxford English Corpus
2006
600 million – 1 billion
Web
Future
? billions

So what is corpus linguistics?
= the “study of language using corpora”
= empirical methodology
= a useful means of exploring:
Synchronic and diachronic variation
Syntax, semantics, pragmatics
Lexicography
Dialects, minority languages
Not just English

Corpus techniques we utilise
Retrieval
Frequency profiling
Concordancing
Collocations
Key words
Key domains
Annotation
POS tagging
Semantic tagging

Annotation
Part of speech
tagging
Semantic field
tagging
Retrieval
Frequency lists
Concordances

Key words
Text
Keywords
Text or
reference
corpus
What are “key words”?
And why are they so useful?

Key words
Word Clouds
If we compare
text A
… with text B … we can discover the most
significant items within text A
… and not only
the frequent items

Collocations
Collocation = a relationship between words that tend to
occur together in texts
Words that tend to occur near word X are the collocates
of word X (consider “fish and XXXXX”)
Based on frequency (how frequent separate vs. how
frequent together = T-SCORE & mutual information)
The company a word keeps: implicit associations or
assumptions (also “semantic prosody”, cf. Sinclair, 2002)
e.g., Bachelor: eligible, flat, life, days
Spinster: elderly, widows, sisters, parish

Our scenario
SEM TAGGER
POS TAGGER

Book Search
Other texts – not compiled for
corpus linguistics

Linguistic analysis
Natural language
processing &
Computational linguistics
Corpus Empirical evidence
to inform linguistStatistical and rule-
based language
models
Corpus Linguistics
Theories on
language /
communication
Theories on
language /
communication
Corpus linguistics for the analysis of real linguistic /
paralinguistic / extra-linguistic information
feedback

space for
our own
annotation
some mark-
up for context
audiovisual
element

Task for you:
What corpus steps / resources would you have to
go through in order to…?
**Find the most frequent word in spoken English?
++Identify different collocates with “news” in
written English
>>Determine the most frequent verb tense used in
Business letters
// Determine the key errors that your students make
by comparison with other students’ errors

CORPORA and LANGUAGE TEACHING /
LEARNING
-Mixture between instructional and naturalistic LL
-Fulfilment of both the input and output hypotheses
-”Scaffolding” (though loosely speaking)
-insights concerning English culture(s)
-Student-centred and related to constructivism:
mastering corpora = learning autonomy

CORPUS-BASED ESL/EFL ACTIVITIES
-Focus on lexis, grammar and register
-introductory notions concerning collocation,
colligation, and formal vs. informal
-For already motivated students: corpora on-line

Activity one:
contractions, formal or
informal?
spoken or written?

Activity two:
Corpora as a source of
knowledge concerning
collocation and colligation

Activity three:
meaning via collocations and
co-text

Activity four: Lexical descriptions
**Contact with the English language:
input (at least lexis-wise)

author
corpus
reference
corpus

Select the text
you want a list
of

Save both lists to
compare them
with Keyword

author
corpus list
reference
corpus list

Some concluding remarks = Task for you:
See what kinds of corpus things you would have
to do to:
>> Demonstrate that the use of some verbs is
incorrect
>> Investigate real lexical levels needed for
different learning stages (e.g., beginner versus
intermediate)
** Determine the main grammatical structures in
conversation that learners may need

Adolphs, Svenja (2006). Introducing electronic text analysis: a practical
guide for language and literary studies. London: Routledge.
Biber, Douglas, Conrad, Susan, & Reppen, Randi. (1998). Corpus
Linguistics: Investigating language structure and use. Cambridge: CUP.
Hockey, Susan. (2000). Electronic texts in the humanities: Principles and
practice. Oxford: Oxford University Press.
Hunston, Susan. (2002). Corpora in applied linguistics. Cambridge:
Cambridge University Press.
Kennedy, Graeme D. (1998). An introduction to corpus linguistics.
London: Longman.
McEnery, Tony, & Wilson, Andrew. (2001). Corpus linguistics (2nd
Ed.). Edinburgh : Edinburgh University Press.
Meyer, Charles. (2002). English corpus linguistics: An introduction.
Cambridge: Cambridge University Press.
Ooi, Vincent B.Y. (1998). Computer Corpus Lexicography. Cambridge
Sampson, Geoffrey & McCarthy, Diana (Eds.). (2004). Corpus
linguistics. London: Continuum.
Scott, Mike, & Tribble, Chris. (2006). Textual patterns: Keyword and
corpus analysis in language education, Amsterdam: Benjamins.
Sinclair, John (1991) Corpus, Concordance, Collocation. Oxford
UCREL website (many others … 1991 – 1999 / 2000 – 2005 / (…) ?

Corpus linguistics intro

More Related Content

What's hot

Similar to Corpus linguistics intro

Recently uploaded

Corpus linguistics intro