Section III
Page 105 to 150
Presented
by
Ata ul ghafer &shoiba sabir
Department of Applied linguistics
GCUF
Chapter 9
What corpora are available?
by David Y. W.
 Outline
 What are corpora
 Types of corpora
1. General corpora
2. Specialized corpora
3. Speech corpora
4. Parsed corpora
5. Historical corpora
Chapter 9
What corpora are available?
by David Y. W.
 Outline
1. Multimedia corpora
2. Developmental, learner and lingua franca corpora
3. ESL/EFL learner corpora
4. Parallel corpora
5. comparable corpora
6. multilingual corpora
What are corpora
 A Latin word “body / mass”
 A collection of written texts, especially the entire works
of a particular author or a body of writing on a
particular subject: "the Darwinian corpus“
 Corpora’ are a large and structured set of texts
(nowadays usually electronically stored and processed).
 They are used to do statistical analysis and hypothesis
testing, checking occurrences or validating linguistic
rules within a specific language territory.
Types of corpora
General Corpora
 The texts that do not belong to a single text type,
subject field, or register.
 May include written or spoken language, or both.
 May include texts produced in one country or
many.
 They aim to represent language in its broadest
sense and to serve as a widely available resource
for baseline or comparative studies of general
linguistic features.
Examples
 Brown Corpus – 1 million words.
 LOB Corpus – 1 million words.
 BNC (British National Corpus) – 100 million words.
Specialized Corpora
 Texts that are designed with more specific
research goals in mind – register-specific
descriptions and investigations of language.
 It aims to be representative of a given type
of text.
 Used to investigate a particular type of
language.
 The kind of texts included are limited:
 A time frame – such as a particular century.
 A social setting – such as conversations
taking place in a bookshop.
 A given topic – such as newspaper articles
dealing with a particular thing.
Examples
 Cambridge and Nottingham Corpus of
Discourse in English (CANCODE) (informal
registers of British English) – 5 million
words.
 Michigan Corpus of Academic Spoken
English (MICASE) (spoken registers in a
US academic setting) – 5 million words.
Historical or Diach
Historical Corpora
 Texts from different periods of time.
 Aim at representing an earlier stage(s) of a
language. They help to trace the development of a
language over time.
 Example:
Helsinki Corpus - 700 to 1700 texts 1.5
million words
Speech corpora
sound recordings
-SPOKEN ENGLISH CORPUS
-detailed description of spoken phenomena: phonology,
prosody (stress, tone units…), etc
multimedia corpora:
-transcripts synchronised audio/video recordings
-TALKBANK Website: SANTA BARBARACORPUS OF
SPOKEN AMERICAN ENGLISH (SBCSAE)
Learner’s Corpora
 Aim at representing the language as produced by the
learners of a language, and they include spoken or
written language samples produced by non-native
speakers.
 They are used to identify differences among learners’
frequency of words and types of mistakes.
 In what respects learners differ from each other and
from the language of native speakers
Example
Louvain Corpus of Native English Essays (LOCNEE)
International Corpus of Learner English (ICLE)
20,000 words.
Multilingual Corpora
 Any systematic collection of empirical language
data enabling linguists to carry out analyses of
multilingual individuals, multilingual societies or
multilingual communication.
Comparable Corpora
 Two (or more) corpora in different languages (e.g.
English and Spanish) or in different varieties of a
language (e.g. Indian English and Canadian English).
They are designed along the same lines – will contain
the same proportions of newspaper texts, novels, casual
conversation, etc.
 Comparable corpora of varieties of the same language
can be used to compare those varieties.
 Comparable corpora of different languages can be used
by translators to identify differences and equivalences
in each language.
 Example International Corpus of English (ICE) are
comparable corpora of 1 million words each of different
varieties of Eng
Parallel Corpora
 Two (or more) corpora in different
languages, each containing texts that
have been translated from one language
into the other, or texts that have been
produced simultaneously in two or more
languages.
 Can be used by translators and by
learners to find potential equivalent
expressions in each language and to
investigate differences between
languages.
parsed corpora:
-syntactically analysed
-SURFACE AND UNDERLYING STRUCTURAL ANALYSES AND
NATURALISTIC ENGLISH CORPUS (SUSANNE)
developmental language corpora:
-non-adult English native speakers' output
-not as proficient as native-speaker corpora
-POLYTECHNIC OF WALES (POW) CORPUS
ESL/EFL learner corpora:
-learners of English's output
-one and the same L1 background or different mother
tongues
-JAPANESE EFL LEARNER CORPUS (JEFLL)

What corpora are available? by David Y. W.D

  • 1.
    Section III Page 105to 150 Presented by Ata ul ghafer &shoiba sabir Department of Applied linguistics GCUF
  • 2.
    Chapter 9 What corporaare available? by David Y. W.  Outline  What are corpora  Types of corpora 1. General corpora 2. Specialized corpora 3. Speech corpora 4. Parsed corpora 5. Historical corpora
  • 3.
    Chapter 9 What corporaare available? by David Y. W.  Outline 1. Multimedia corpora 2. Developmental, learner and lingua franca corpora 3. ESL/EFL learner corpora 4. Parallel corpora 5. comparable corpora 6. multilingual corpora
  • 4.
    What are corpora A Latin word “body / mass”  A collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject: "the Darwinian corpus“  Corpora’ are a large and structured set of texts (nowadays usually electronically stored and processed).  They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.
  • 5.
    Types of corpora GeneralCorpora  The texts that do not belong to a single text type, subject field, or register.  May include written or spoken language, or both.  May include texts produced in one country or many.  They aim to represent language in its broadest sense and to serve as a widely available resource for baseline or comparative studies of general linguistic features.
  • 6.
    Examples  Brown Corpus– 1 million words.  LOB Corpus – 1 million words.  BNC (British National Corpus) – 100 million words.
  • 7.
    Specialized Corpora  Textsthat are designed with more specific research goals in mind – register-specific descriptions and investigations of language.  It aims to be representative of a given type of text.  Used to investigate a particular type of language.  The kind of texts included are limited:  A time frame – such as a particular century.  A social setting – such as conversations taking place in a bookshop.  A given topic – such as newspaper articles dealing with a particular thing.
  • 8.
    Examples  Cambridge andNottingham Corpus of Discourse in English (CANCODE) (informal registers of British English) – 5 million words.  Michigan Corpus of Academic Spoken English (MICASE) (spoken registers in a US academic setting) – 5 million words. Historical or Diach
  • 9.
    Historical Corpora  Textsfrom different periods of time.  Aim at representing an earlier stage(s) of a language. They help to trace the development of a language over time.  Example: Helsinki Corpus - 700 to 1700 texts 1.5 million words
  • 10.
    Speech corpora sound recordings -SPOKENENGLISH CORPUS -detailed description of spoken phenomena: phonology, prosody (stress, tone units…), etc multimedia corpora: -transcripts synchronised audio/video recordings -TALKBANK Website: SANTA BARBARACORPUS OF SPOKEN AMERICAN ENGLISH (SBCSAE)
  • 11.
    Learner’s Corpora  Aimat representing the language as produced by the learners of a language, and they include spoken or written language samples produced by non-native speakers.  They are used to identify differences among learners’ frequency of words and types of mistakes.  In what respects learners differ from each other and from the language of native speakers
  • 12.
    Example Louvain Corpus ofNative English Essays (LOCNEE) International Corpus of Learner English (ICLE) 20,000 words.
  • 13.
    Multilingual Corpora  Anysystematic collection of empirical language data enabling linguists to carry out analyses of multilingual individuals, multilingual societies or multilingual communication.
  • 14.
    Comparable Corpora  Two(or more) corpora in different languages (e.g. English and Spanish) or in different varieties of a language (e.g. Indian English and Canadian English). They are designed along the same lines – will contain the same proportions of newspaper texts, novels, casual conversation, etc.  Comparable corpora of varieties of the same language can be used to compare those varieties.  Comparable corpora of different languages can be used by translators to identify differences and equivalences in each language.  Example International Corpus of English (ICE) are comparable corpora of 1 million words each of different varieties of Eng
  • 15.
    Parallel Corpora  Two(or more) corpora in different languages, each containing texts that have been translated from one language into the other, or texts that have been produced simultaneously in two or more languages.  Can be used by translators and by learners to find potential equivalent expressions in each language and to investigate differences between languages.
  • 16.
    parsed corpora: -syntactically analysed -SURFACEAND UNDERLYING STRUCTURAL ANALYSES AND NATURALISTIC ENGLISH CORPUS (SUSANNE) developmental language corpora: -non-adult English native speakers' output -not as proficient as native-speaker corpora -POLYTECHNIC OF WALES (POW) CORPUS
  • 17.
    ESL/EFL learner corpora: -learnersof English's output -one and the same L1 background or different mother tongues -JAPANESE EFL LEARNER CORPUS (JEFLL)