2. What is a corpus?
• a collection of words?
• Is it a theory or methodology of language?
3. Why use a corpus?
• Large amounts of data tell us about tendencies
and what’s normal or typical in real-life language
use
• Corpora also reveal instances of very rare or
exceptional cases, that we wouldn’t get from
looking at single texts or introspection.
• Human researchers make mistakes and are
slow. Computers are much quicker and more
accurate.
4. Criteria in building a corpus
1. It must be a large body of text.
2. It needs to be representative of language (or
a genre of language).
3. Must be in machine-readable form (e.g. txt
files on a computer).
4. Acts as a standard reference about what’s
typical in language.
5. Often annotated with additional linguistic
information – e.g. grammatical codes.
5. annotation and mark-up
corpus texts may be enriched with additional information to ease
analysis.
Note that this type of additional information may be called ‘mark up’,
‘annotation’, or ‘tagging’. All three terms are near synonyms.
Annotation usually refers to linguistic information encoded in a corpus
- however, the encoding is achieved using a mark-up language.
Similarly, the annotation itself is usually undertaken by putting so
called tags - short codes to indicate some linguistics feature - into a
text. Hence, while the terms can be separated, they can also be used
inter-changeably!
One final note - an xml tag finishes with a forward slash rather than a
back slash.
6. Some untagged text
“Arrest warrant out for Clowes’ partner years
before collapse.”
By Daniel John
A WARRANT for the arrest of the former partner
of Mr Peter Clowes was issued seven years
before his Barlow Clowes investment empire
collapsed, according to evidence submitted to
the Parliamentary Ombudsman.
7. Add tags for headers and paragraphs
<head type=MAIN>
“Arrest warrant out for Clowes’ partner years before collapse.”
</head>
<head type=BYLINE>
By Daniel John
</head>
<p>
A WARRANT for the arrest of the former partner of Mr Peter Clowes
was
issued seven years before his Barlow Clowes investment empire
collapsed, according to evidence submitted to the Parliamentary
Ombudsman.
</p>
8. • Add sentence tags
<head type=MAIN>
<s n=001>“Arrest warrant out for Clowes’ partner years before
collapse.”
</head>
<head type=BYLINE>
<s n=002>By Daniel John
</head>
<p>
<s n=003>A WARRANT for the arrest of the former partner of Mr Peter
Clowes was issued seven years before his Barlow Clowes investment
empire collapsed, according to evidence submitted to the
Parliamentary Ombudsman.
</p>
9. Change quotes to SGML
<head type=MAIN>
<s n=001>&bquo;Arrest warrant out for Clowes’ partner years before
collapse&equo;
</head>
<head type=BYLINE>
<s n=002>By Daniel John
</head>
<p>
<s n=003>A WARRANT for the arrest of the former partner of Mr Peter
Clowes was
issued seven years before his Barlow Clowes investment empire
collapsed,
according to evidence submitted to the Parliamentary Ombudsman.
</p>
10. Add tags for punctuation
<head type=MAIN>
<s n=001><c PUQ>&bquo;Arrest warrant out for Clowes<c PUN>’ partner
years
before collapse <c PUQ>&equo;
</head>
<head type=BYLINE>
<s n=002>By Daniel John
</head>
<p>
<s n=003>A WARRANT for the arrest of the former partner of Mr Peter
Clowes
was issued seven years before his Barlow Clowes investment empire
collapsed, according to evidence submitted to the Parliamentary Ombudsman
<c PUN>.
</p>
12. Types of Corpora
1 Specialised corpus – e.g.
• genre: the language of newspapers
• time: 2005 to the present day
• place: just texts published in China
2 General corpus – needs to be much larger. E.g.
The British
National Corpus (BNC) has about 100 million words
of
spoken and written British English:
14. Types of Corpora
3. Multilingual corpus – e.g. English and Spanish. Or American
English and Indian English. http://ice-corpora.net/ICE/INDEX.HTM
4. Parallel corpus – e.g. English and Spanish – exactly the
same texts translated. E.g. the CRATER corpus
http://catalog.elra.info/product_info.php?products_id=84
5. Learner corpus – language use created by people learning a
particular language. E.g. the International Corpus of
Learner English.
6. Historical or Diachronic corpus – e.g. Helsinki corpus – 1.5 million
words of texts from 700AD to 1700AD.
7. Monitor corpus – continually being added to. e.g. the Bank
of English
http://www.collins.co.uk/page/Wordbanks+Online
15.
16.
17.
18. frequency data, concordances and
collocation
• Frequencies
Your query "wash" returned 2415 matches in
952 different texts (in 97,626,093 words; freq:
24.74 instances per million words)
23. Corpora and Language Teaching
Textbooks
• Dictionaries
• Classroom Exercises
• Tests
• Learner Corpora
24. Limitations of Corpus linguistics
It won’t tell us if something is possible in a language, or
well-formed. E.g. is “he expired of heart disease” acceptable
English?
• Any generalisations we make from corpus data can only be
deductions – not facts.
• Corpora give us evidence, but not information or
explanations. Why do women say “wash” more than men?
• Corpora give us language out of context – so no visual
information e.g. pictures, fonts etc. And with spoken data –
no information on what the speakers look like, behaviour or
body language.
25. Further Reading
• McEnery, Tony & Wilson, Andrew (2001)
Corpus Linguistics.
Edinburgh: Edinburgh University Press. Chapter
1.
• Hunston, S. (2002) Corpora in Applied
Linguistics.
Cambridge: Cambridge University Press. Chapter
1.
26. Question 1
What is a corpus?
• A theory of language.
• A collection of texts stored on a computer.
• An electronic database similar to a dictionary.
• Any large collection of words such as a
collection of books, newspapers or magazines.
27. Question 2
What is the main reason for using corpora?
• Other methods of language analysis are not reliable.
• Computers can confirm our intuitions about language.
• Computers can help us discover interesting patterns in
language which would be difficult to spot otherwise.
• With corpora we can answer all research questions
about language.
28. Question 3
What is corpus annotation?
• Adding an extra layer of information to the
text to allow for more sophisticated searches.
• Separating text into sentences.
• Manual coding of text for parts of speech.
• Adding critical comments to a text.
29. Question 4
What is a specialised corpus?
• A corpus that is used for historical language investigations.
• A corpus that is composed of a large variety of genres.
• A corpus that is used by language specialists.
•
• A corpus that focuses on e.g. one type of genre, one period,
one place etc.
30. Question 5
Which of these is NOT a type of corpus?
• Multilingual corpus
• Learner corpus
• Diachronic corpus
• Observer corpus
31. Question 6
What is the BNC?
• A large general corpus of British English.
• A corpus of different genres of English writing.
• A large spoken corpus of British English.
• A specialised corpus representing the language of
newspapers.
32. Question 7
Which of these statements is NOT true about a
monitor corpus?
• It is frequently updated.
• The Bank of English is an example of a monitor
corpus.
• The BNC is an example of a monitor corpus.
• It is used to monitor rapid change in language.
33. Question 8
What is a concordance?
• Information about word frequencies normalised
per million words.
• Listing of examples of a word searched in a
corpus with some context on the right and some
context on the left.
• An alphabetical list of words that appear in a text.
• A list of words and their frequencies that can be
used for identifying important words in a text.
34. Question 9
What is collocation?
• The tendency of speakers to talk over each other.
• The tendency of words to co-occur with one
another.
• The tendency of words to appear in unique,
different contexts each time.
• The tendency of sentences to create meaning.
35. Question 10
What is a frequency distribution in a corpus?
• Information about how frequent a word is in a corpus.
• Information about the frequency of use of a term
across a number of different texts, corpus sections,
speakers etc.
• Information about how frequent a word is per million
words.
• Sociolinguistic information about the gender of the
speakers that are represented in a corpus.
36. Brown and LOB View 80 comments
These corpora are sometimes referred to as ‘snapshot’ corpora - their design is such that they try to represent
a broad range of genres of published, professionally authored, English. Their goal is to capture the language at
one moment in time, hence the term ‘snapshot’.
Of course, as with any snapshot there are things you see and things you do not see. So, in this case, we are
looking at professionally authored written English - not speech and not writing of a more informal variety. We
are also only looking at certain genres. As with any snapshot, it was taken at a certain point of time in a certain
place - Brown is America in the early 1960s, LOB is the UK in the early 1960s. Such corpora are often used to
compare and contrast varieties of a language - in this case two varieties of English. They can also be looked at
on their own to explore either variety of English in its own right.
The Brown corpus is so named because it was developed at Brown University in the US. LOB is an acronym,
standing for Lancaster-Oslo-Bergen, the three Universities that collaborated to build that corpus.
Back to the snapshot metaphor! The two corpora can be compared because they are composed in the same
way - the subject is the same, if you like. They look at broadly the same genres. Those genres are represented
by similarly sized and numbers of chunks of data. Also, of course, the data was gathered in roughly the same
time period.
The genres covered in the two corpora are outlined below. Note the letter code for each genre - that is
important, as it shows you which genre is associated with which file in the corpus. Following the letter code is a
description of the type of data in the category, followed by two numbers in parentheses - the first is the
number of chunks of data in that category in Brown, the second is the number of chunks of data in that
category in LOB. There are five hundred chunks of data in each corpus. Each chunk is approximately 2,000
words in size, giving a rough overall corpus size of 1,000,000 words each.