4. corpus
•A corpus can be defined
as a systematic collection
of naturally occurring
text in electronic form.
5.
6. Corpus linguistics
• Corpus linguistics is the study of
language/linguistic phenomena
through the analysis of data
obtained from a corpus.
• Corpus linguistic is the analysis of
text with the help of computer,
i.e. with specialized software.
7. •A corpus is always designed for a
particular purpose, the usefulness of a
ready made corpus must be judged
with regard to the purpose to which a
user intends to put it.
8. Famous corpora
•The Brown Corpus
•The Lancaster-Oslo/Bergen
•The London Lund Corpus
•The British National Corpus
9.
10. The Brown Corpus
• The Brown Corpus of Standard
American English was the first of
the modern, computer readable,
general corpora. The corpus
consists of one million words of
American English texts printed in
1961.
11. The Lancaster-
Oslo/Bergen
• The Lancaster-Oslo/Bergen Corpus
is a million word collection of British
English texts which was compiled in
the 1970s in collaboration between
the University of Lancaster, The
University of Oslo, and the
Norwegian Computing Center for
the Humanities, Bergen.
12. The London Lund
Corpus
• The London Lund Corpus of
English derives from two projects:
the Survey of English Usage at
University College London and the
Survey of Spoken English, which was
started at Lund University in 1975.
the corpus consists of 500,000 words
of spoken British English.
13.
14. The British National Corpus
• The British National Corpus is a
100 million collection of samples of
written and spoken language from a
wide range of sources, designed to
represent a wide cross-section of
British English from the later part of
the 20th century.
15. Creation of BNC:
• The project was developed by an
academic consortium called BNC
consortium.
• An industrial/academic consortium
lead by Oxford University press of
which the members are more
dictionary publishers.
16. • The Consortium was formed in
1990 and started work in 1991 on
the three year task of producing a
hundred million word corpus of
modern British English for use in
commercial and academic research.
All major decisions regarding BNC
are still made by them.
18. Why we use BNC
• BNC can be used to know about aspects we
did not know about a word and to check our
thoughts about its meaning. Moreover, the
corpus can help to find out the meaning of a
word not just what we think it means. We can
use BNC to check either a word is a part of
BNC or not.
20. Bnc is a sample of 100 million
words including spoken and
written Britain English. It is a
balanced and finite corpus that
contains approximately 90%
written data and 10%spoken
data.
Features of British National Corpus
22. The conversational part:-
• This part is largely based on recordings of every
day conversation interaction engaged in by some
127 adults aged 15 and over. Some additional
recording of under fifteen were included from
COLT. The volunteers were selected according to
demographic area of age, social group, and sex
with the aim of obtaining approximately equal
number in each group. well, conversational part
make up just over 40% of the spoken corpus.
23. Respondents in ‘’conversational part”
were selected according to following
properties;
Age Social
group
Sex Percenta
ge
Under
fifteen
Upper
class
Male 41.14
15-24 Middle
class Female
58.47
24-34 Lower
class
Unclassi
fied
0.38
24. The task oriented part:
In this material was intended to represent
those types of task oriented spoken activity
that were unlikely to be recorded by
conversational volunteers during a typical day
in their lives. e.g. Lectures, consultations,
sermons, T.V/radio broadcasting etc and this
part contains 60% of spoken corpus.
26. Continued…..
Imaginative text account for 20% and
informative text about 80% in written
components. the imaginative text are divided
into further categories prose, poetry etc. on the
other hand informative data is subdivided into
eight categories.
1.Arts 2.Natural sciences
3.Commerce 4.Applied sciences
5.Leisure 6.Social sciences
7.Beliefs and arts 8. World affairs
27. Abbreviations and acronyms:
BNC provides us the same abbreviated
sequence in many different ways such as
P.C, PC, P.C although the same forms
reflect different origins .(police
Constable, postcard, personal computer)
28. Monolingual:
Although BNC include many
different styles, verities and genera
yet it deal with only modern British
English and not with other
languages used in Britain.
29. Synchronic:
BNC Covers British English of the late twentieth
century ,rather than the historical development
which produced it. it is updated time by time or
with the passage of time
31. First edition
• The first edition of BNC was
completed in 1994.
• The first general release of the corpus
for European researchers was
announced in February 1995.
32. BNC World
• BNC World, a slightly revised version was
made available in 2001, indicates that the
corpus is now available under license
world wide.
33. BNC is available in two flavors;
1. Under the single user license (cost 50
pound) you can install the whole corpus
and the SARA software on a single
machine for personal use.
2. Alternatively, for the same price, you can
install just the corpus itself and use
whatever software you want.
34. BNC XML
• BNC XML is the latest version of the
British National Corpus.
• XML stands for Extensible Markup
Language.
• XML is a set of rules for encoding
documents in machine readable form.
35. • The main differences between this version
and the BNC World are:
1. Errors and inconsistencies have been
removed.
2. Lemma information.
3. Simplified part of speech information
added.
36. • BNC XML can be accessed in three ways:
1. Online use.
2. Download the corpus and XAIRA.
3. Download just the corpus and use it with
any software you want.
37. •Two subsets of BNC have
been produced separately:
• BNC Baby.
• BNC Sampler.
38. BNC Baby
• BNC Baby is a subset of
the BNC. It consists of
four one million word
samples, each compiled
as an example of a
particular genre: fiction,
newspaper, academic
writing and spoken
conversation.
39. BNC Sampler
• The BNC Sampler is a subset of the full BNC. It
comprises two samples of written and spoken
material of one million word each, compiled to
mirror the composition of the full BNC as far as
possible.
• The sampler was first created at Lancaster University
during the creation of the BNC.
40. Online use of BNC
• Go to the home page.
• Put the word into search bar and then click on the
search button.
• It will show the content in which the word is being
used.
• For instance, if we look for a word “couch” the
corpus will show us its collocations, frequency and
KWIC.