Dr. VMS

 Linguistics being the scientific study of language
and its structure, ‘corpus linguistics’ is the study of
language “on the basis of text corpora.”
 The analysis does not stop at the description of those
texts; rather the contexts are also focused upon.
 A corpus is a collection of machine-readable, authentic
texts, chosen to characterize or represent a state or
variety of a language.
Corpus Linguistics

 For linguistic research
 Allow more effective corpus searches For natural
language processing
 Spelling and grammar checking
 Machine translation
 Question and answering
 descriptions for use in a variety of applications such as
language learning and teaching, natural language
processing by machine, including speech recognition and
translation.
Purpose of Corpus
Linguistics

Authenticity
Objectivity
Verifiability
Exposure to large amounts of data
New insights into language
Corpora can help learners discover new meanings of the
words they already know
The Longman Learners‘ Corpus contains ten million words of
text written by learners of English of different levels of
proficiency and from twenty different L1 backgrounds.
The Cambridge Learner Corpus is a large collection of written
texts from learners of English all over the world.
Scope of Corpus

 A means to explore actual patterns of language use.
 A tool for developing materials for classroom
language instruction.
 To explore different questions about language use.
 To provide powerful tools for analysis of natural
languages.
 To give an insight about how language use varies in
different situations.
Corpus Linguistics

 Leech (1992): an unexciting phenomenon, a helluva
lot of text, stored on a computer
 Francis (1982):a collection of texts assumed to be
representative of a given language, dialect, or other
subset of a language to be used for linguistic analysis
 Sinclair (1991):a collection of naturally-occurring
language text, chosen to characterise a state or a
variety of language
What is a corpus?

 General-purpose vs. specialized corpora
 The British National Corpus Michigan Corpus of
Academic Spoken English
 Native vs. learner corporaInternational Corpus of Learner
English
 Monolingual vs. parallel & comparable corpora
 The JRC-Acquis Multilingual Parallel Corpus
 The English-Chinese Parallel Concordancer
 Corpora representing one or diverse language varieties
International Corpus of English
 Synchronic vs. diachronic corpora
 Spoken vs. written corpora
Types of corpora

 The texts that do not belong to a single text type,
subject field, or register.
 May include written or spoken language, or both.
 May include texts produced in one country or many.
 They aim to represent language in its broadest sense
and to serve as a widely available resource for
baseline or comparative studies of general linguistic
features.
General Corpora

 used to produce reference materials for language
learning or translation.
 Often used as a baseline in comparison with more
specialized corpora.
 Also sometimes known as ‘reference corpora’.
 Brown Corpus – 1 million words.
 LOB Corpus – 1 million words.
 BNC (British National Corpus) – 100 million words.
Reference Corpora

 Texts that are designed with more specific research
goals in mind – register-specific descriptions and
investigations of language.
 It aims to be representative of a given type of text.
 Used to investigate a particular type of language.
Specialized Corpora

 Texts from different periods of time.
 Aim at representing an earlier stage(s) of a language.
 They help to trace the development of a language
over time.
 Helsinki Corpus - 700 to 1700 texts 1.5 million words
Historical or Diachronic
Corpora

 Aim at representing a regional variety of a language,
such as dialects.
Regional Corpora


corpus linguistics.pptx

  • 1.
  • 2.
      Linguistics beingthe scientific study of language and its structure, ‘corpus linguistics’ is the study of language “on the basis of text corpora.”  The analysis does not stop at the description of those texts; rather the contexts are also focused upon.  A corpus is a collection of machine-readable, authentic texts, chosen to characterize or represent a state or variety of a language. Corpus Linguistics
  • 3.
      For linguisticresearch  Allow more effective corpus searches For natural language processing  Spelling and grammar checking  Machine translation  Question and answering  descriptions for use in a variety of applications such as language learning and teaching, natural language processing by machine, including speech recognition and translation. Purpose of Corpus Linguistics
  • 4.
     Authenticity Objectivity Verifiability Exposure to largeamounts of data New insights into language Corpora can help learners discover new meanings of the words they already know The Longman Learners‘ Corpus contains ten million words of text written by learners of English of different levels of proficiency and from twenty different L1 backgrounds. The Cambridge Learner Corpus is a large collection of written texts from learners of English all over the world. Scope of Corpus
  • 5.
      A meansto explore actual patterns of language use.  A tool for developing materials for classroom language instruction.  To explore different questions about language use.  To provide powerful tools for analysis of natural languages.  To give an insight about how language use varies in different situations. Corpus Linguistics
  • 6.
      Leech (1992):an unexciting phenomenon, a helluva lot of text, stored on a computer  Francis (1982):a collection of texts assumed to be representative of a given language, dialect, or other subset of a language to be used for linguistic analysis  Sinclair (1991):a collection of naturally-occurring language text, chosen to characterise a state or a variety of language What is a corpus?
  • 7.
      General-purpose vs.specialized corpora  The British National Corpus Michigan Corpus of Academic Spoken English  Native vs. learner corporaInternational Corpus of Learner English  Monolingual vs. parallel & comparable corpora  The JRC-Acquis Multilingual Parallel Corpus  The English-Chinese Parallel Concordancer  Corpora representing one or diverse language varieties International Corpus of English  Synchronic vs. diachronic corpora  Spoken vs. written corpora Types of corpora
  • 8.
      The textsthat do not belong to a single text type, subject field, or register.  May include written or spoken language, or both.  May include texts produced in one country or many.  They aim to represent language in its broadest sense and to serve as a widely available resource for baseline or comparative studies of general linguistic features. General Corpora
  • 9.
      used toproduce reference materials for language learning or translation.  Often used as a baseline in comparison with more specialized corpora.  Also sometimes known as ‘reference corpora’.  Brown Corpus – 1 million words.  LOB Corpus – 1 million words.  BNC (British National Corpus) – 100 million words. Reference Corpora
  • 10.
      Texts thatare designed with more specific research goals in mind – register-specific descriptions and investigations of language.  It aims to be representative of a given type of text.  Used to investigate a particular type of language. Specialized Corpora
  • 11.
      Texts fromdifferent periods of time.  Aim at representing an earlier stage(s) of a language.  They help to trace the development of a language over time.  Helsinki Corpus - 700 to 1700 texts 1.5 million words Historical or Diachronic Corpora
  • 12.
      Aim atrepresenting a regional variety of a language, such as dialects. Regional Corpora
  • 13.