Presented by
Prof.R.R.Borse,
Asst.Prof. & HOD,
English Department,
B.P.Arts,S.M.A.Sci.,K.K.C.Comm.College Chalisgaon
Dist.Jalgaon
Mail – ravindraborse1@gmail.com
6/17
Computers and corpus linguistics
• Historically, manual analysis of large bodies of text
(esp. in literary and biblical studies)
– Error-prone, time-consuming, not verifiable
• Computers have introduced
– Reliability, accuracy and replicability
– increased speed and capacity means you can do more on a
grander scale
– new tools mean you can do things you might not have
thought of doing
7/17
What is a corpus?
• Corpus (pl. corpora) = ‘body’
• Collection of written text or transcribed speech
• Usually but not necessarily purposefully collected
• Usually but not necessarily structured
• Usually but not necessarily annotated
• (Usually stored on and accessible via computer)
• Corpus ~ text archive
DEFINITION
• Corpus linguistics is the study
of language based on large collections of "real
life" language use stored
in corpora (or corpuses)--computerized
databases created for linguistic research. Also
known as corpus-based studies.
• Corpus linguistics is a method of carrying out
linguistic analyses.
• Corpus linguistics thus is the analysis of
naturally occurring language on the basis of
computerized corpora.
• Usually, the analysis is performed with the
help of the computer, i.e. with specialised
software, and takes into account the
frequency of the phenomena investigated.
• The availability of computers and machine-
readable text has made it possible to get data
quickly and easily and also to have this data
presented in a format suitable for analysis.
• The main task of the corpus linguist is not to
find the data but to analyse it.
Daniel Nkemleke, Humboldt
Kolleg Kamerun, 30/07/2008
Introduction: what is Corpus Linguistics?
• The study of language based on examples of “real life“ language use, collected,
stored and processed via computer
• Facilitated by the advent of computer technology (1960s)
• Latin: corpus (body): body of text  any collection
of more than one text, written or spoken
The word "corpus", derived from the Latin word meaning "body", may be used to
refer to any text in written or spoken form.
14/17
What is corpus linguistics?
• Not a branch of linguistics, like socio~,
psycho~, …
• Not a theory of linguistics
• A set of tools and methods (and a philosophy)
to support linguistic investigation across all
branches of the subject
INTRODUCTION TO CORPUS
LINGUISTICS
• Corpus linguistics can be described as the
study of language based on text corpora.
• A corpus is a collection of machine-readable,
authentic texts, chosen to characterize or
represent a state or variety of a language.
• Corpus v. Text archive
• Representativeness
WHY USE CORPORA?
• Authenticity
• Objectivity
• Verifiability
• Exposure to large amounts of data
• New insights into language
• Enhancement of learner motivation
Best known corpora
• The Birmingham Collection of English Texts
(COBUILD)
• The Bank of English
• The British National Corpus (BNC)
• The Brown Corpus
• The Lancaster-Oslo/Bergen Corpus (LOB)
• The Helsinki Corpus of English Texts: Diachronic
and Dialectal
• The International Corpus of English (ICE)
Daniel Nkemleke, Humboldt
Kolleg Kamerun, 30/07/2008
Some (main) existing corpora
L1 Corpora
• Brown Corpus of American English
• Lancaster-Oslo/Bergen Corpus (LOB)
• London-Lund Corpus
• British National Corpus (BNC)
• Birmingham Corpus of British English
L2 Corpora
• ICE-East Africa (Kenya & Tanzania)
• Corpus of Cameroon English
• Corpus of Nigerian English ??
• Kolhapur Corpus of Indian English
Multinational Corpus Project
• International Corpus of English (ICE)
BTANT 129 w5
21/17
BNC (1995)
• http://www.natcorp.ox.ac.uk/
• 100m word collection of written and spoken
text from 1975-93 (already dated in some
respects!)
• Carefully designed and balanced
• Corpus is closed (finite, synchronic)
• All text tagged to high quality
• Lots of tools available for exploration
BTANT 129 w5
Daniel Nkemleke, Humboldt
Kolleg Kamerun, 30/07/2008
Corpus utility
• possible ways in which a corpus may be useful
1. Corpora as a source of empirical data
2. Corpora in language teaching and learning
3. Corpora in Lexical studies
4. Corpora in grammar studies
5. Corpora in speech research
6. Corpora and semantic studies
7. Corpora in pragmatic and discourse studies
8. Corpora in sociolinguistic studies
9. Corpora and stylistic studies
10. Corpora in historical linguistics
11. Corpora in dialectology and variational studies
12. Corpora in Psycholinguistics
13. Corpora in cultural studies
BTANT 129 w5
International Corpus of English
• 20 corpora of 1 m words devoted to varieties
of English around the world
• 500 texts (300 written 200 spoken) of 2000
words each
• time span: 1990-0996
• ICE-GB available in demo version
• syntactic annotation, graphical tool ICECUP
BTANT 129 w5
British National Corpus
• 100 m words careful selection
• 10 % spoken material
• time span 1960 (fiction) – 1975 non-ficion)
• 40-50 000 word texts
• TEI compliant SGML coding
• http://www.comp.lancs.ac.uk/ucrel/bncindex/
BTANT 129 w5
Short history
Brief mention of just a select few!
• Brown Corpus (Brown university)
– 1 m words
– 15 genres
– 500 samples 2000 words each
– Area: US
– Time: 1961
• LOB Corpus (Lancaster-Bergen-Oslo)
– GB replica of Brown
Historical background of Corpus
Linguistics
• R. Quirk’s Survey of English Usage (SEU)
• Advent of computers
• First corpora
• The Brown Corpus
Corpus creation
• The design of a corpus is dependent upon
the type of a corpus and purpose for which
the corpus is to be used.
• Types of corpora (sample, monitor, general,
spoken, written, learner, translation, parallel,
comparable, etc).
New insights into language
• Sinclair noted (1991:1) that “traditionally linguistics has been
limited to what a single individual could experience and
remember… Starved of adequate data, linguistics languished –
indeed it became totally introverted. It became fashionable to
look inwards to the mind rather than outwards to society.
Intuition was the key, and similarity of language structure to
various formal models was emphasized. The communicative
role of language was hardly referred to…. Students of
linguistics over many years have been urged to rely heavily on
their intuition and to prefer their intuitions to actual text
where there is some discrepancy. Their study has, therefore,
been more about intuition than about language”.
New insights into language
• Many subtle observations.
• Corpora can help learners discover new
meanings of the words they already know.
• New understanding of meaning in Corpus
Linguistics.
Example -
• Some differences between strong and powerful (source:
British National Corpus):
– strong
– powerful
• The differences are subtle, but examining their collocates
helps.
wind, feeling, accent, flavour
tool, weapon, punch, engine
Example (British National Corpus)
• British National Corpus (BNC):
– 100 million words of English
• 90% written, 10% spoken
– Designed to be representative and balanced.
– Texts from different genres (literature, news,
academic writing…)
– Annotated: Every single word is accompanied by
part-of-speech information.
Example -
• A sentence in the BNC:
– Explosives found on Chalisgaon station.
• <s>
• <w NN2>Explosives
• <w VVD>found
• <w PRP>on
• <w NP0>Chalisgaon
• <w NP0>station
• <PUN>.
Example (continued)
• <s>
• <w NN2>Explosives
• <w VVD>found
• <w PRP>on
• <w NP0>Chalisgaon
• <w NP0>station
• <PUN>.
Explosives found on Chalisgaon station.
new sentence
plural noun
past tense verb
preposition
proper noun
noun
punctuation
BTANT 129 w5
One quick example
• Representativity or representativeness
• Throw the two words at Google and have a
look at the figures
• Think about the conclusions
• There are special front-end sites
BTANT 129 w5
BTANT 129 w5
BTANT 129 w5
Daniel Nkemleke, Humboldt
Kolleg Kamerun, 30/07/2008
Concordances : arrive _ NP (Simplification)
Important to note
• This is not “raw” text.
– Annotation means we can search for particular patterns.
– E.g. for the quiver/quake study: “find all occurrences of quiver
which are verbs, followed by a determiner and a noun”
• The collection is very large
– Only in very large collections are we likely to find rare
occurrences.
• Corpus search is done by computer. You can’t trawl
through 100 million words manually!
Corpus Linguistics
Corpus Linguistics
Corpus Linguistics
Corpus Linguistics
Corpus Linguistics
Corpus Linguistics
Corpus Linguistics
Corpus Linguistics
Corpus Linguistics
Corpus Linguistics
Corpus Linguistics
Corpus Linguistics
Corpus Linguistics
Corpus Linguistics
Corpus Linguistics
Corpus Linguistics
Corpus Linguistics
Corpus Linguistics
Corpus Linguistics
Corpus Linguistics
Corpus Linguistics

Corpus Linguistics

  • 1.
    Presented by Prof.R.R.Borse, Asst.Prof. &HOD, English Department, B.P.Arts,S.M.A.Sci.,K.K.C.Comm.College Chalisgaon Dist.Jalgaon Mail – ravindraborse1@gmail.com
  • 6.
    6/17 Computers and corpuslinguistics • Historically, manual analysis of large bodies of text (esp. in literary and biblical studies) – Error-prone, time-consuming, not verifiable • Computers have introduced – Reliability, accuracy and replicability – increased speed and capacity means you can do more on a grander scale – new tools mean you can do things you might not have thought of doing
  • 7.
    7/17 What is acorpus? • Corpus (pl. corpora) = ‘body’ • Collection of written text or transcribed speech • Usually but not necessarily purposefully collected • Usually but not necessarily structured • Usually but not necessarily annotated • (Usually stored on and accessible via computer) • Corpus ~ text archive
  • 10.
    DEFINITION • Corpus linguisticsis the study of language based on large collections of "real life" language use stored in corpora (or corpuses)--computerized databases created for linguistic research. Also known as corpus-based studies. • Corpus linguistics is a method of carrying out linguistic analyses.
  • 11.
    • Corpus linguisticsthus is the analysis of naturally occurring language on the basis of computerized corpora. • Usually, the analysis is performed with the help of the computer, i.e. with specialised software, and takes into account the frequency of the phenomena investigated.
  • 12.
    • The availabilityof computers and machine- readable text has made it possible to get data quickly and easily and also to have this data presented in a format suitable for analysis. • The main task of the corpus linguist is not to find the data but to analyse it.
  • 13.
    Daniel Nkemleke, Humboldt KollegKamerun, 30/07/2008 Introduction: what is Corpus Linguistics? • The study of language based on examples of “real life“ language use, collected, stored and processed via computer • Facilitated by the advent of computer technology (1960s) • Latin: corpus (body): body of text  any collection of more than one text, written or spoken The word "corpus", derived from the Latin word meaning "body", may be used to refer to any text in written or spoken form.
  • 14.
    14/17 What is corpuslinguistics? • Not a branch of linguistics, like socio~, psycho~, … • Not a theory of linguistics • A set of tools and methods (and a philosophy) to support linguistic investigation across all branches of the subject
  • 15.
    INTRODUCTION TO CORPUS LINGUISTICS •Corpus linguistics can be described as the study of language based on text corpora. • A corpus is a collection of machine-readable, authentic texts, chosen to characterize or represent a state or variety of a language. • Corpus v. Text archive • Representativeness
  • 16.
    WHY USE CORPORA? •Authenticity • Objectivity • Verifiability • Exposure to large amounts of data • New insights into language • Enhancement of learner motivation
  • 18.
    Best known corpora •The Birmingham Collection of English Texts (COBUILD) • The Bank of English • The British National Corpus (BNC) • The Brown Corpus • The Lancaster-Oslo/Bergen Corpus (LOB) • The Helsinki Corpus of English Texts: Diachronic and Dialectal • The International Corpus of English (ICE)
  • 19.
    Daniel Nkemleke, Humboldt KollegKamerun, 30/07/2008 Some (main) existing corpora L1 Corpora • Brown Corpus of American English • Lancaster-Oslo/Bergen Corpus (LOB) • London-Lund Corpus • British National Corpus (BNC) • Birmingham Corpus of British English L2 Corpora • ICE-East Africa (Kenya & Tanzania) • Corpus of Cameroon English • Corpus of Nigerian English ?? • Kolhapur Corpus of Indian English Multinational Corpus Project • International Corpus of English (ICE)
  • 20.
  • 21.
    21/17 BNC (1995) • http://www.natcorp.ox.ac.uk/ •100m word collection of written and spoken text from 1975-93 (already dated in some respects!) • Carefully designed and balanced • Corpus is closed (finite, synchronic) • All text tagged to high quality • Lots of tools available for exploration
  • 22.
  • 23.
    Daniel Nkemleke, Humboldt KollegKamerun, 30/07/2008 Corpus utility • possible ways in which a corpus may be useful 1. Corpora as a source of empirical data 2. Corpora in language teaching and learning 3. Corpora in Lexical studies 4. Corpora in grammar studies 5. Corpora in speech research 6. Corpora and semantic studies 7. Corpora in pragmatic and discourse studies 8. Corpora in sociolinguistic studies 9. Corpora and stylistic studies 10. Corpora in historical linguistics 11. Corpora in dialectology and variational studies 12. Corpora in Psycholinguistics 13. Corpora in cultural studies
  • 24.
    BTANT 129 w5 InternationalCorpus of English • 20 corpora of 1 m words devoted to varieties of English around the world • 500 texts (300 written 200 spoken) of 2000 words each • time span: 1990-0996 • ICE-GB available in demo version • syntactic annotation, graphical tool ICECUP
  • 25.
    BTANT 129 w5 BritishNational Corpus • 100 m words careful selection • 10 % spoken material • time span 1960 (fiction) – 1975 non-ficion) • 40-50 000 word texts • TEI compliant SGML coding • http://www.comp.lancs.ac.uk/ucrel/bncindex/
  • 26.
    BTANT 129 w5 Shorthistory Brief mention of just a select few! • Brown Corpus (Brown university) – 1 m words – 15 genres – 500 samples 2000 words each – Area: US – Time: 1961 • LOB Corpus (Lancaster-Bergen-Oslo) – GB replica of Brown
  • 27.
    Historical background ofCorpus Linguistics • R. Quirk’s Survey of English Usage (SEU) • Advent of computers • First corpora • The Brown Corpus
  • 28.
    Corpus creation • Thedesign of a corpus is dependent upon the type of a corpus and purpose for which the corpus is to be used. • Types of corpora (sample, monitor, general, spoken, written, learner, translation, parallel, comparable, etc).
  • 29.
    New insights intolanguage • Sinclair noted (1991:1) that “traditionally linguistics has been limited to what a single individual could experience and remember… Starved of adequate data, linguistics languished – indeed it became totally introverted. It became fashionable to look inwards to the mind rather than outwards to society. Intuition was the key, and similarity of language structure to various formal models was emphasized. The communicative role of language was hardly referred to…. Students of linguistics over many years have been urged to rely heavily on their intuition and to prefer their intuitions to actual text where there is some discrepancy. Their study has, therefore, been more about intuition than about language”.
  • 30.
    New insights intolanguage • Many subtle observations. • Corpora can help learners discover new meanings of the words they already know. • New understanding of meaning in Corpus Linguistics.
  • 44.
    Example - • Somedifferences between strong and powerful (source: British National Corpus): – strong – powerful • The differences are subtle, but examining their collocates helps. wind, feeling, accent, flavour tool, weapon, punch, engine
  • 45.
    Example (British NationalCorpus) • British National Corpus (BNC): – 100 million words of English • 90% written, 10% spoken – Designed to be representative and balanced. – Texts from different genres (literature, news, academic writing…) – Annotated: Every single word is accompanied by part-of-speech information.
  • 46.
    Example - • Asentence in the BNC: – Explosives found on Chalisgaon station. • <s> • <w NN2>Explosives • <w VVD>found • <w PRP>on • <w NP0>Chalisgaon • <w NP0>station • <PUN>.
  • 47.
    Example (continued) • <s> •<w NN2>Explosives • <w VVD>found • <w PRP>on • <w NP0>Chalisgaon • <w NP0>station • <PUN>. Explosives found on Chalisgaon station. new sentence plural noun past tense verb preposition proper noun noun punctuation
  • 48.
    BTANT 129 w5 Onequick example • Representativity or representativeness • Throw the two words at Google and have a look at the figures • Think about the conclusions • There are special front-end sites
  • 49.
  • 50.
  • 51.
  • 52.
    Daniel Nkemleke, Humboldt KollegKamerun, 30/07/2008 Concordances : arrive _ NP (Simplification)
  • 53.
    Important to note •This is not “raw” text. – Annotation means we can search for particular patterns. – E.g. for the quiver/quake study: “find all occurrences of quiver which are verbs, followed by a determiner and a noun” • The collection is very large – Only in very large collections are we likely to find rare occurrences. • Corpus search is done by computer. You can’t trawl through 100 million words manually!