Netflix Ads The Game Changer in Video Ads – Who Needs YouTube.pptx (Chester Y...
What corpora are available? by David Y. W.D
1. Section III
Page 105 to 150
Presented
by
Ata ul ghafer &shoiba sabir
Department of Applied linguistics
GCUF
2. Chapter 9
What corpora are available?
by David Y. W.
Outline
What are corpora
Types of corpora
1. General corpora
2. Specialized corpora
3. Speech corpora
4. Parsed corpora
5. Historical corpora
3. Chapter 9
What corpora are available?
by David Y. W.
Outline
1. Multimedia corpora
2. Developmental, learner and lingua franca corpora
3. ESL/EFL learner corpora
4. Parallel corpora
5. comparable corpora
6. multilingual corpora
4. What are corpora
A Latin word “body / mass”
A collection of written texts, especially the entire works
of a particular author or a body of writing on a
particular subject: "the Darwinian corpus“
Corpora’ are a large and structured set of texts
(nowadays usually electronically stored and processed).
They are used to do statistical analysis and hypothesis
testing, checking occurrences or validating linguistic
rules within a specific language territory.
5. Types of corpora
General Corpora
The texts that do not belong to a single text type,
subject field, or register.
May include written or spoken language, or both.
May include texts produced in one country or
many.
They aim to represent language in its broadest
sense and to serve as a widely available resource
for baseline or comparative studies of general
linguistic features.
6. Examples
Brown Corpus – 1 million words.
LOB Corpus – 1 million words.
BNC (British National Corpus) – 100 million words.
7. Specialized Corpora
Texts that are designed with more specific
research goals in mind – register-specific
descriptions and investigations of language.
It aims to be representative of a given type
of text.
Used to investigate a particular type of
language.
The kind of texts included are limited:
A time frame – such as a particular century.
A social setting – such as conversations
taking place in a bookshop.
A given topic – such as newspaper articles
dealing with a particular thing.
8. Examples
Cambridge and Nottingham Corpus of
Discourse in English (CANCODE) (informal
registers of British English) – 5 million
words.
Michigan Corpus of Academic Spoken
English (MICASE) (spoken registers in a
US academic setting) – 5 million words.
Historical or Diach
9. Historical Corpora
Texts from different periods of time.
Aim at representing an earlier stage(s) of a
language. They help to trace the development of a
language over time.
Example:
Helsinki Corpus - 700 to 1700 texts 1.5
million words
10. Speech corpora
sound recordings
-SPOKEN ENGLISH CORPUS
-detailed description of spoken phenomena: phonology,
prosody (stress, tone units…), etc
multimedia corpora:
-transcripts synchronised audio/video recordings
-TALKBANK Website: SANTA BARBARACORPUS OF
SPOKEN AMERICAN ENGLISH (SBCSAE)
11. Learner’s Corpora
Aim at representing the language as produced by the
learners of a language, and they include spoken or
written language samples produced by non-native
speakers.
They are used to identify differences among learners’
frequency of words and types of mistakes.
In what respects learners differ from each other and
from the language of native speakers
12. Example
Louvain Corpus of Native English Essays (LOCNEE)
International Corpus of Learner English (ICLE)
20,000 words.
13. Multilingual Corpora
Any systematic collection of empirical language
data enabling linguists to carry out analyses of
multilingual individuals, multilingual societies or
multilingual communication.
14. Comparable Corpora
Two (or more) corpora in different languages (e.g.
English and Spanish) or in different varieties of a
language (e.g. Indian English and Canadian English).
They are designed along the same lines – will contain
the same proportions of newspaper texts, novels, casual
conversation, etc.
Comparable corpora of varieties of the same language
can be used to compare those varieties.
Comparable corpora of different languages can be used
by translators to identify differences and equivalences
in each language.
Example International Corpus of English (ICE) are
comparable corpora of 1 million words each of different
varieties of Eng
15. Parallel Corpora
Two (or more) corpora in different
languages, each containing texts that
have been translated from one language
into the other, or texts that have been
produced simultaneously in two or more
languages.
Can be used by translators and by
learners to find potential equivalent
expressions in each language and to
investigate differences between
languages.
16. parsed corpora:
-syntactically analysed
-SURFACE AND UNDERLYING STRUCTURAL ANALYSES AND
NATURALISTIC ENGLISH CORPUS (SUSANNE)
developmental language corpora:
-non-adult English native speakers' output
-not as proficient as native-speaker corpora
-POLYTECHNIC OF WALES (POW) CORPUS
17. ESL/EFL learner corpora:
-learners of English's output
-one and the same L1 background or different mother
tongues
-JAPANESE EFL LEARNER CORPUS (JEFLL)