Corpus linguistics

 Group Members:
 Ayesha Azhar
 Bareera Akbar
 Irum Masood
 Maryam Ahmed
 Tahira Jabeen

Incomprehensible
consciously
A sea of
words
Essence of
human
beings

 A Latin word “body / mass”
 A collection of written texts, especially the entire
works of a particular author or a body of writing
on a particular subject: "the Darwinian corpus“
Corpora (plural)

History of Corpus Linguistics
 Language study is not a new idea.
 1921: 30,000 words. A Treasure, but of no use.
 1960 with the advent of computer....
 The use of collections of COMPUTER-READABLE text for
language study.
 Brown Corpus of Standard American English.
 One million words of American English texts printed in 1964.
 First electronic corpus

Corpus Linguistics
 Linguistics being the scientific study of language
and its structure, ‘corpus linguistics’ is the study
of language “on the basis of text corpora.”
 The analysis does not stop at the description of
those texts; rather the contexts are also focused
upon.

Place for Corpus Linguistics in
Applied Linguistics
 A means to explore actual patterns of language use.
 A tool for developing materials for classroom language
instruction.
 To explore different questions about language use.
 To provide powerful tools for analysis of natural
languages.
 To give an insight about how language use varies in
different situations.

Corpora
 ‘Corpora’ are a large and structured set of texts
(nowadays usually electronically stored and
processed).
 They are used to do statistical analysis and
hypothesis testing, checking occurrences or
validating linguistic rules within a specific
language territory.

General Corpora
 The texts that do not belong to a single text type,
subject field, or register.
 May include written or spoken language, or both.
 May include texts produced in one country or
many.
 They aim to represent language in its broadest
sense and to serve as a widely available resource
for baseline or comparative studies of general
linguistic features.

 May be used to produce reference materials for
language learning or translation.
 Often used as a baseline in comparison with more
specialized corpora.
 Also sometimes known as ‘reference corpora’.

Examples
 Brown Corpus – 1 million words.
 LOB Corpus – 1 million words.
 BNC (British National Corpus) – 100 million
words.

Specialized Corpora
 Texts that are designed with more specific research goals
in mind – register-specific descriptions and
investigations of language.
 It aims to be representative of a given type of text.
 Used to investigate a particular type of language.
 The kind of texts included are limited:
 A time frame – such as a particular century.
 A social setting – such as conversations taking place in
a bookshop.
 A given topic – such as newspaper articles dealing
with a particular thing.

Examples
 Cambridge and Nottingham Corpus
of Discourse in English
(CANCODE) (informal registers of
British English) – 5 million words.
 Michigan Corpus of Academic
Spoken English (MICASE) (spoken
registers in a US academic setting) –
5 million words.

Historical or Diachronic Corpora
Texts from different
periods of time.
Aim at representing an
earlier stage(s) of a
language.
They help to trace the
development of a language
over time.

Example
Helsinki Corpus - 700 to 1700 texts
1.5 million words

Regional Corpora
Aim at representing a regional variety of a
language, such as dialects.

Learner’s Corpora
 Aim at representing the language as produced by the
learners of a language, and they include spoken or
written language samples produced by non-native
speakers.
 They are used to identify differences among learners’
frequency of words and types of mistakes.
 In what respects learners differ from each other and
from the language of native speakers

Example
 Louvain Corpus of Native
English Essays (LOCNEE)
 International Corpus of Learner
English (ICLE)
 20,000 words

Multilingual Corpora
 Any systematic collection of empirical language data
enabling linguists to carry out analyses of multilingual
individuals, multilingual societies or multilingual
communication.

Comparable Corpora
 Two (or more) corpora in different languages (e.g. English
and Spanish) or in different varieties of a language (e.g.
Indian English and Canadian English).
 They are designed along the same lines – will contain the
same proportions of newspaper texts, novels, casual
conversation, etc.
 Comparable corpora of varieties of the same language can be
used to compare those varieties.
 Comparable corpora of different languages can be used by
translators to identify differences and equivalences in each
language.

Example
 International Corpus of English (ICE) are
comparable corpora of 1 million words each of
different varieties of English.

Parallel Corpora
 Two (or more) corpora in different languages, each
containing texts that have been translated from one
language into the other, or texts that have been produced
simultaneously in two or more languages.
 Can be used by translators and by learners to find
potential equivalent expressions in each language and to
investigate differences between languages.

 Size
 Representativeness
 Registers / modes / topics
 Demographics
 Production / reception
 Research goals
 Funding
 Time
 Staff/students

Written Corpora
 Obtaining/creating, Storing, Organizing
Materials Required:
-scanner, OCR software
Process:
-paper document into electronic text file
Types:
-newspapers, periodicals
-small specialized corpora
-informal writings (travel diaries, e-mail,
discussion, blogs, news groups)

Spoken Corpora
 deciding on a transcription system
I. prosodic/non prosodic
II. representing interactional characteristics of
speech (over lapping speech, back channels,
pauses, non-verbal contextual events)
III. permission to use data
IV. ensuring anonymity
V. avoiding impracticality of data

Markup
1. Structural markups:
-written corpus: Titles, authors, paragraphs, subheadings,
chapters etc.
-spoken corpus: Contextual events, paralinguistic features
2: Header:
-written corpus:
Classification into categories(register, genre, topic domain, discourse
mode, formality)
-spoken corpus:
Demographic infirmation about speaker(gender,social
class,occupation,age,native language/dialect)
Relationship among the participants

Linguistic Annotation
Parts of Speech Tagging:
Grammatical category, case assigning
Prosodic Annotation
Phonetic Annotation
Syntactic Parsing

Advantages of Tagging
Vast exploration
Frequency
Co-occurance
Multiple meaning studies
Automatically retrievable

Concordance Lines
 Concordance lines are a useful tool for
investigating corpora, but their use is limited by
the ability of the human observer to process
information.
 There are some statistical calculations of
collocation and corpus annotation.

Frequency and Key-word Lists
 A frequency list is a list of all the types in a corpus
together with the number of occurrences of each type.
Comparing the frequency lists for two corpora can give
interesting information
 About the differences between the two texts.
e.g.) Kennedy (1998)
 a comparison between a corpus of Economics texts
and one of general academic English→ the words price,
cost, demand, curve, firm… are frequently found in the
Economics corpus.

Keywords
 A useful starting point in investigating a
specialized corpus.
 They can be lexical items which reflect the topic
of a particular text but also grammatical words
which convey more subtle information.

Collocation
 The tendency of words to be biased in the way
they co-occur.
 Statistical measurements of collocation are more
reliable, and for this reason a
corpus is essential.

Measurements of Collocation
 Computer programs, which calculate collocation,
take a node word and count the instances of all words
occurring within a particular span.
 (note) the count ignores punctuation marks.
 Counts ‘s’ as a separate word.
 Ignores sentence boundaries.

Tagging and Parsing
 Tagging is allocating a part of speech (POS) label
to each word in a corpus.
 e.g.) the word light ・・・tagged as verb, a noun
or an adjective each time it occurs in the corpus.
 Parsing is analyzing the sentences in a corpus
into their constituent parts, that is, doing.

Annotation
 General term for tagging and parsing, and also
used to describe other kinds of categorisation that
may be performed on a corpus.
(e.g.) The annotation of a spoken corpus for
prosodic features.
 The annotation of a corpus of learner English for
types of error.
 Annotation of anaphora and semantic annotation.

Softwares
 Special software is used in order to analyze a corpus and
certain words or phrases.
For example
• Sara for the BNC
• ICECUP for the ICE Great Britain.
• Concordancers can be used for the analysis of almost
any corpus.

Concordancer
 One of the most frequently used concordancers is
‘Wordsmith Tools’.
 Its two most important tools are:
 Concord and WordList
 As an alternative to Wordsmith, you can also use a
concordancer called ‘AntConc’ which can be
downloaded for free.

WordSmith Concord
 Click on the Wordsmith icon on the desktop to
open the program. Select concord in order to
search a corpus for a certain word or phrase. You
can now choose a corpus and select those files of
the corpus you want to analyse.

Some further options for entering a search word or
phrase:
 By using the asterisk *, you can widen the scope of your
search. For example, entering going as a search word will
provide you only with all instances of going; entering going
to with all instances of going to. If you type in go*, on the
other hand, you will get all words beginning with go-, e.g.
going, goes, gold. Searching for *ing, you will get all words
ending in –ing, e.g. swimming, dancing, sing.

WordSmith WordList
 The tool WordList generates word lists of the
selected text files and enables you to compare the
length of text files or corpora.
 Moreover, you can use WordList to compare the
frequency of a word in different text files or
across genres and to identify common clusters.

AntConc Concordance tool
 This tool shows the words or word strings you want to
analyse in their textual context.
 Select the files you want to analyse: File > Open file(s)
 Choose the tab "Concordance"
 Type in a search word (“Search Term”, bottom left-hand
corner)

 More reliable than intuition.
 Language patterns are easily identified.
 Deconstruct texts to discover patterns.
 Track the development of specific features in the
history of English.
 Test hypothesis on specific language features
empirically.
 Follow language acquisition properly.
 Draw conclusions on large amounts of linguistics data.
 Frequency rather than the possibility.
 Not always a complete picture.

 More communicative modes:
 spoken corpora, interactional corpora (classroom
interactions, authentic interactions, etc) multimodal
corpora, corpora of textbook materials, etc.
 More text types and genres, to cover text types which are
less represented in corpora (letters, emails, leaflets, TV
programs, book synopses, recipes, short notes, chat room
logs, etc.),

 More longitudinal language data:
 from beginners to advanced levels, from children to adults,
from L1 to L2.
 More variables:
 more language learning variables should be collected and
encoded at the time of corpus collection (proficiency, language
aptitude, motivation, more precise description of the task, of
temporal, social or situational settings, etc).
 More languages:
 to counterbalance the predominance of Anglo-Saxon native and
learner corpora and to foster the computer-aided analysis of
different languages and language families.

 Prior to Corpus Linguistics it was difficult to note patterns of
use in language, since observing and tracking usage patterns
was a monumental task.
 Scholars have used various types of corpora to gain insights
into changes related to language development, both in first
and second language situations.
 Corpus Linguistics can help in telling about language use and
how it varies in different situations.

Corpus linguistics

More Related Content

What's hot

Viewers also liked

Similar to Corpus linguistics

Recently uploaded

Corpus linguistics