Corpus Linguistics

Presented by
Prof.R.R.Borse,
Asst.Prof. & HOD,
English Department,
B.P.Arts,S.M.A.Sci.,K.K.C.Comm.College Chalisgaon
Dist.Jalgaon
Mail – ravindraborse1@gmail.com

6/17
Computers and corpus linguistics
• Historically, manual analysis of large bodies of text
(esp. in literary and biblical studies)
– Error-prone, time-consuming, not verifiable
• Computers have introduced
– Reliability, accuracy and replicability
– increased speed and capacity means you can do more on a
grander scale
– new tools mean you can do things you might not have
thought of doing

7/17
What is a corpus?
• Corpus (pl. corpora) = ‘body’
• Collection of written text or transcribed speech
• Usually but not necessarily purposefully collected
• Usually but not necessarily structured
• Usually but not necessarily annotated
• (Usually stored on and accessible via computer)
• Corpus ~ text archive

DEFINITION
• Corpus linguistics is the study
of language based on large collections of "real
life" language use stored
in corpora (or corpuses)--computerized
databases created for linguistic research. Also
known as corpus-based studies.
• Corpus linguistics is a method of carrying out
linguistic analyses.

• Corpus linguistics thus is the analysis of
naturally occurring language on the basis of
computerized corpora.
• Usually, the analysis is performed with the
help of the computer, i.e. with specialised
software, and takes into account the
frequency of the phenomena investigated.

• The availability of computers and machine-
readable text has made it possible to get data
quickly and easily and also to have this data
presented in a format suitable for analysis.
• The main task of the corpus linguist is not to
find the data but to analyse it.

Daniel Nkemleke, Humboldt
Kolleg Kamerun, 30/07/2008
Introduction: what is Corpus Linguistics?
• The study of language based on examples of “real life“ language use, collected,
stored and processed via computer
• Facilitated by the advent of computer technology (1960s)
• Latin: corpus (body): body of text  any collection
of more than one text, written or spoken
The word "corpus", derived from the Latin word meaning "body", may be used to
refer to any text in written or spoken form.

14/17
What is corpus linguistics?
• Not a branch of linguistics, like socio~,
psycho~, …
• Not a theory of linguistics
• A set of tools and methods (and a philosophy)
to support linguistic investigation across all
branches of the subject

INTRODUCTION TO CORPUS
LINGUISTICS
• Corpus linguistics can be described as the
study of language based on text corpora.
• A corpus is a collection of machine-readable,
authentic texts, chosen to characterize or
represent a state or variety of a language.
• Corpus v. Text archive
• Representativeness

WHY USE CORPORA?
• Authenticity
• Objectivity
• Verifiability
• Exposure to large amounts of data
• New insights into language
• Enhancement of learner motivation

Best known corpora
• The Birmingham Collection of English Texts
(COBUILD)
• The Bank of English
• The British National Corpus (BNC)
• The Brown Corpus
• The Lancaster-Oslo/Bergen Corpus (LOB)
• The Helsinki Corpus of English Texts: Diachronic
and Dialectal
• The International Corpus of English (ICE)

Some (main) existing corpora
L1 Corpora
• Brown Corpus of American English
• Lancaster-Oslo/Bergen Corpus (LOB)
• London-Lund Corpus
• British National Corpus (BNC)
• Birmingham Corpus of British English
L2 Corpora
• ICE-East Africa (Kenya & Tanzania)
• Corpus of Cameroon English
• Corpus of Nigerian English ??
• Kolhapur Corpus of Indian English
Multinational Corpus Project
• International Corpus of English (ICE)

21/17
BNC (1995)
• http://www.natcorp.ox.ac.uk/
• 100m word collection of written and spoken
text from 1975-93 (already dated in some
respects!)
• Carefully designed and balanced
• Corpus is closed (finite, synchronic)
• All text tagged to high quality
• Lots of tools available for exploration

Corpus utility
• possible ways in which a corpus may be useful
1. Corpora as a source of empirical data
2. Corpora in language teaching and learning
3. Corpora in Lexical studies
4. Corpora in grammar studies
5. Corpora in speech research
6. Corpora and semantic studies
7. Corpora in pragmatic and discourse studies
8. Corpora in sociolinguistic studies
9. Corpora and stylistic studies
10. Corpora in historical linguistics
11. Corpora in dialectology and variational studies
12. Corpora in Psycholinguistics
13. Corpora in cultural studies

BTANT 129 w5
International Corpus of English
• 20 corpora of 1 m words devoted to varieties
of English around the world
• 500 texts (300 written 200 spoken) of 2000
words each
• time span: 1990-0996
• ICE-GB available in demo version
• syntactic annotation, graphical tool ICECUP

BTANT 129 w5
British National Corpus
• 100 m words careful selection
• 10 % spoken material
• time span 1960 (fiction) – 1975 non-ficion)
• 40-50 000 word texts
• TEI compliant SGML coding
• http://www.comp.lancs.ac.uk/ucrel/bncindex/

BTANT 129 w5
Short history
Brief mention of just a select few!
• Brown Corpus (Brown university)
– 1 m words
– 15 genres
– 500 samples 2000 words each
– Area: US
– Time: 1961
• LOB Corpus (Lancaster-Bergen-Oslo)
– GB replica of Brown

Historical background of Corpus
Linguistics
• R. Quirk’s Survey of English Usage (SEU)
• Advent of computers
• First corpora
• The Brown Corpus

Corpus creation
• The design of a corpus is dependent upon
the type of a corpus and purpose for which
the corpus is to be used.
• Types of corpora (sample, monitor, general,
spoken, written, learner, translation, parallel,
comparable, etc).

New insights into language
• Sinclair noted (1991:1) that “traditionally linguistics has been
limited to what a single individual could experience and
remember… Starved of adequate data, linguistics languished –
indeed it became totally introverted. It became fashionable to
look inwards to the mind rather than outwards to society.
Intuition was the key, and similarity of language structure to
various formal models was emphasized. The communicative
role of language was hardly referred to…. Students of
linguistics over many years have been urged to rely heavily on
their intuition and to prefer their intuitions to actual text
where there is some discrepancy. Their study has, therefore,
been more about intuition than about language”.

New insights into language
• Many subtle observations.
• Corpora can help learners discover new
meanings of the words they already know.
• New understanding of meaning in Corpus
Linguistics.

Example -
• Some differences between strong and powerful (source:
British National Corpus):
– strong
– powerful
• The differences are subtle, but examining their collocates
helps.
wind, feeling, accent, flavour
tool, weapon, punch, engine

Example (British National Corpus)
• British National Corpus (BNC):
– 100 million words of English
• 90% written, 10% spoken
– Designed to be representative and balanced.
– Texts from different genres (literature, news,
academic writing…)
– Annotated: Every single word is accompanied by
part-of-speech information.

Example -
• A sentence in the BNC:
– Explosives found on Chalisgaon station.
• <s>
• <w NN2>Explosives
• <w VVD>found
• <w PRP>on
• <w NP0>Chalisgaon
• <w NP0>station
• <PUN>.

Example (continued)
• <s>
• <w NN2>Explosives
• <w VVD>found
• <w PRP>on
• <w NP0>Chalisgaon
• <w NP0>station
• <PUN>.
Explosives found on Chalisgaon station.
new sentence
plural noun
past tense verb
preposition
proper noun
noun
punctuation

BTANT 129 w5
One quick example
• Representativity or representativeness
• Throw the two words at Google and have a
look at the figures
• Think about the conclusions
• There are special front-end sites

Concordances : arrive _ NP (Simplification)

Important to note
• This is not “raw” text.
– Annotation means we can search for particular patterns.
– E.g. for the quiver/quake study: “find all occurrences of quiver
which are verbs, followed by a determiner and a noun”
• The collection is very large
– Only in very large collections are we likely to find rare
occurrences.
• Corpus search is done by computer. You can’t trawl
through 100 million words manually!

Corpus Linguistics

More Related Content

What's hot

Similar to Corpus Linguistics

More from Dr.Ravindra Borse

Recently uploaded

Corpus Linguistics