In this part how to design acquire and process a
collection of linguistic data which will form the raw
material for a dictionary is going to be explained
i Questions (1)
1. What is a reliable dictionary?
2. What is subjective evidence and its limits?
3 What is a citation?
4. What should be the basic steps in setting up a
5 Wh t are th advantages and di d
d disadvantages of
i Questions (2)
6. What is a corpus?
7. What are the points that should be considered in
designing a corpus?
8. How large should a corpus be?
9. How do we decide what kinds of written or spoken
material our corpus should include?
10. Can a corpus be representative?
i Questions (3)
11. What i ‘ k i ’?
11 Wh t is ‘skewing’?
12 Wh are the questions that should be
h ld b
answered before starting to form the corpus?
13. What is linguistic annotation?
A ‘R li bl ’ Di ti
A reliable dictionary is one whose
approximate closely to the ways in which
people normally use language when engaging
in real communicative acts. Yet, it is
difficult to determine how people normally
use words. There is a need for evidence.
Subjective Evidence and Its Limits
Introspection: consulting your own mental l
l lexicon, is a
form of evidence, but it cannot form the basis of a
reliable dictionary alone, since one individual’s store of
Informant-testing: in which speakers of a language are
questioned about their use of words, is also of limited
value for mainstream lexicography for similar reasons.
g p y
Both f h
B h of them are essentially subjective f
Creating a reliable dictionary involves a number of
challenging tasks, but it is for sure that the observation
of language in use is the indispensable first stage in the
A citation is a short extract from a
text which provides evidence for a
word, phrase, usage, or meaning i
Until the late
citations would be written in
longhand on index cards known as
slips These were filed alphabetically
according to the keyword of the
If a blog has a common ancestor
with the diary one can say that it
has a DNA.
E g ‘MySpace’ shares at least some
of its DNA with the ‘scrapbook’.
Setting up a Reading Programme
Some dictionary publishers provide online
forms to enable members of the public to
contribute citations Most of these publishers
get unusable citations since their programmes
are not well-planned. A good reading
programme, on the other hand, will often have
Setting up a Reading Programme
There is a need for at least four main data fields:
1- keyword or phrase: the usage that the citation illustrates,
filed under the headword to which it relates.
2- the citation itself: usually a single sentence is adequate, but
there may be more than one.
3- Information about the source of the citation: the date, title,
and author’s name are all important; additional information
(such as the page number) may be useful for specialized or
4- a comment field: this gives readers the option of adding a
c mm nt f
r a r th
pt n f a ng
note to clarify the citation; it may, for example, be a new
meaning that needs explaining, or it may be characteristic of
one particular dialect.
Advantages of Cit ti
1- they are helpful to monitor language change
2- They give information about the terminology
from a specific subject field or a particular
variety or dialect.
3- They are helpful in training the
Disadvantages of Cit ti
11 Collecting data in this way is labour intensive
so volumes will always be low.
2- Although instances of usage are authentic,
there is a bi s bj ti
big subjective element in th i
l m nt
The Central Role of the Corpus
Citation bank alone - even the largest one –
cannot usually supply language data in the
required volumes so the case for a large
corpus is clear.
A “corpus” is a collection of pieces of
language text in electronic form, selected
according to external criteria to represent,
as far as possible, a language or language
variety as a source of data for li
research (Sinclair 2005).
Inescapable T th
There is no such thing as a perfect corpus for
First of all, the corpus is a sample. It is not possible to
examine every extant example of usage for the languages. To
create a sample that fairly reflects the wider population,
there is a need for carefully selected criteria.
Secondly, selecting texts on the basis of their ‘quality’, and
excluding those which fail this test, is fundamentally at odds
with th d s ipti
ith the descriptive ethos of corpus lin isti s Wh is t
th s f
p s linguistics. Who
judge which texts are ‘good’, and on what basis? It is clear
that a lexicographic corpus must be a genuine – and inclusivesnapshot of a language, not a set of texts that have been
specially chosen to advance someone’s notion of what
constitutes ‘good’ usage.
Corpora: Design Issues
D i I
Designing a corpus means making decisions about:
11 how large it will be
22 which broad categories of text it will include
33 what proportions of each category it will include
4- hi h i di id l texts
4 which individual t t it will include.
ill i l d
Size: How large is large enough?
It i f sure th t th more d t we h
have th more we
learn. Yet, there are also some hypotheses on the size
of the corpus. Zipf’s Law predicts that the tenth most
frequent word in a corpus will occur twice as often as
the 20th most frequent word, ten times as often as the
100th most frequent word, and 100 times as often as
the 1000th most frequent word. Thus, it can be said
that in a corpus of 100 million words, a simple right or
left sorted corpus clearly shows most of the normal
patterns of usage for all words except the very rare.
Different texts, different styles
However large its size may be if the words are
taken from only a limited area (for instance from
newspapers), they cannot represent all aspects of
the language, and th results m
th l n
s lts may b misl din
(For instance; the meaning of the word party will
most frequently occur as a political organization
rather than a social event. A corpus consisting of a
single type of text will reflect only the stylistic
and subject-matter features of that particular
genre. It will as corpus linguistics say, a ‘skewed’
corpus. Therefore, the corpus should include
different texts and d ff
d different styles.
Can a Corpus be Representative?
The standard way of avoiding bias is to collect a ‘random sample’.
Y t random s
sampling may not represent th l
s t the language well. O
partial solution is to apply stratified sampling. This involves
breaking up the total population into a number of subcategories or
types, then creating independent random samples from each of
these groupings. But this immediately raises two questions:
1- How do we define these subcategories?
2- How do we decide what proportions of each subcategory the
corpus should include?
It is almost impossible to define the population
that the corpus should be representative of,
and since the population is unlimited, it is
logically impossible to establish ‘correct’
proportions of each component. An achievable
objective should be “a balanced corpus”.
S l ti T t
The corpus collection is usually recursive.
First some texts from a range of sources are gathered
Next the texts are analyzed to identify recurring clusters
of linguistic features.
It enables us to establish provisional categories of texts,
grouped on the basis of shared linguistic features.
Then more texts are collected to reflect these feature
Then the analysis is repeated on the enlarged corpus, on
The process thus proceeds in a cyclical fashion until we
collect a large corpus whose contents reflect the proportions
in which the various key features are observable in large
bodies f text.
b di of t t
Spoken D t A S
i l Case
With a corpus of spoken language, there are no
obvious objective measures that can be used to
define the target population. The spoken data
should represent the variables like gender,
social class, age and religion. The conversations
that form the corpus should reflect the
diversity of the spoken language.
A Note on ‘Skewing’
‘Sk i ’
Skewing refers to a form of bias in data
whereby a particular feature is either over or
under represented to a degree that distorts
the general picture. As corpora grow larger,
usually problems with skewing gradually recede.
There are some questions that should be answered
before starting to form the corpus.
Language: Will the corpus be monolingual, bilingual, or
multilingual? This is an important question before
starting to form the corpus.
Time: Will the corpus be synchronic or diachronic? In
a synchronic corpus, the constituent texts come from
one specific period of time, whereas the texts making
up a diachronic corpus come from an extended period.
Mode: Will the corpus include written texts, spoken
texts or both? The status of the chat room
conversations which have the characteristics of both
spoken and written texts is another point that require
attention in corpus formation.
Medium refers to the channel in which the text
appears. A simple classification here would
distinguish print media and spoken media. The
former in l d
include b ks n sp p s m
books, newspapers, magazines,
journals, dissertations, movie scripts, government
documents and legal statutes. Spoken media
include face-to-face conversations, broadcasts and
podcasts, public meetings, and educational settings.
Once again traditional categories became blurred
when we add the web to the mix. Some ‘new’ text
types (blogs and social networking sites, for
example) are exclusive to the web, b
b but many
documents exist in both print and electronic media.
Dealing ith S bl
D li with Sublanguages
When we think about the vocabulary of a
language, it is useful to make a broad
sublanguages. The word deuce is part of a
sublanguage: it belongs to the vocabulary of
tennis A word like important, on the other
hand, belongs to the core vocabulary of
English. The following question arises at this
point: will we include the sublanguages?
Collecting Written Data
In the past, the work of lexicographers was
not so easy. Earlier corpora made extensive
use of scanning and keyboarding which were
b h slow and l b
d labour-intensive processes.
Today it is possible to find the digital form of
various t t
Collecting Spoken Data
Traditionally, spoken data has been difficult
rad t onally,
d ff cult
and extensive to collect. Consequently,
although the majority of communicative events
in a language occur in spoken mode, few
corpora include high proportions of spoken
material. For instance, only 10 per cent of the
BNC is spoken. Nowadays, web-derived spoken
data hi h ff
d t which offers up-to-date material i l
t d t
t i l in large
quantities and at low cost begins to look like an
Collecting Data from the Web
Th question of ‘‘whether th web is a
f h th the
corpus’ is a hotly debated topic in
language engineering circles. For
g p y,
lexicography, it is better to see the
web as a source of texts from which
a lexicographic corpus can be
There are arguments for using complete
texts rather than extracts. In many
registers, the discourse structure and
l features of a text may vary as it
proceeds from its opening paragraphs,
through its central sections, to the
concluding chapters. The BNC’s solution to
this was to ensure that 40000 word samples
were taken variously from the beginning
middle, and end of its source documents.
Copyright and Permissions
Unless a corpus is made up of much older texts, most
of its source material is likely to be protected by
copyright. S corpus-builders should get permissions
i ht So,
from the copyright owners to include the documents in
their corpus. This is not an easy task. It is one of the
most time consuming aspects of the project It is
recommended that the corpus builders should never
offer to pay for permission to include a text. Once
money starts changing hands a precedent would be
established that could have fatal consequences to
corpus-creation efforts worldwide.
Processing and Annotating
To give the final f
form to the corpus f
p from its raw
state, some operations are carried on.
and text encoding
heterogeneous collection of input document
and converting them all to a standard, usable
form. For instance, non-linguistic sounds in
spoken data (like erm, ooh, mhm) and unusable
texts in written data (like indexes, tables,
diagrams) are not included in the corpus.
Providing each input text with a unique
‘header document’ which records its essential
header document wh ch
ts essent al
features. Headers typically give bibliographic
information (title, author’s name, date and
place of publication, and the like) and
precisely locate each text in whatever
typology is being used.
Enriching raw text by adding grammatical
information which will enable corpus users
to frame sophisticated queries and extract
maximum benefit from the data. For
instance, She is tagged as a personal
pronoun, and R ll is tagged as a general
adverb. A well-tagged corpus allows us to
focus on each pattern in turn and view a
manageable number of examples.
Fi l Th
In this part, a methodology for building a
corpus for use in lexicography has been
g p y
outlined. It is for sure that this is a difficult
task, and there is no perfect corpus since
language is diverse and dynamic. The aim is to
form a balanced, standardized, well-tagged
corpus. For many kinds of research, a corpus
with meticulously detailed headers and finey
grained linguistic annotation is precisely what
Turkish Summary: Sözlüksel Kanıt
Bu bölümde oluşturulması planlanan bir sözlüğe kaynaklık edecek olan
verilerin nasıl tasarlandığı, toplandığını ve işlendiğini anlatılmıştır.
Öncelikle, sözlüğü hazırlayacak kişilerin kendi sözcük dağarcıklarının
önemi vurgulanmalıdır. Ancak şurası kesin ki ne kadar geniş olursa
olsun bireysel sözcük dağarcığı böyle bir çalışma için yeterli değildir.
Geçmişte en sık kullanılan veri toplama yöntemi alıntılama yöntemi idi
Günümüzde bu yöntem eskiye oranla önemini biraz yitirse de hala
kullanılmaktadır. Hatta, internet üzerinden, gönüllülerden veriler
toplanması amacıyla özel programlar geliştirilmiştir. Bilgisayar ve
internet teknolojilerinin gelişmesiyle en fazla kullanılan veri toplama
yöntemi derleme yöntemi olmuştur Bölüm boyunca derlemeyi
hazırlayan kişilerin pek çok soruyla karşı karşıya oldukları ve işlerinin
ne derece zor olduğu vurgulanmıştır. Dil çeşitlilik gösteren ve dinamik
bir yapıdadır bu nedenle mükemmel bir derleme yapılması beklenemez.
Sözlükçülerin amacı dengeli, kullanılan dili en iyi temsil eden, dili
kullanan kişilere ve dilin kullanıldığı ortamlara göre ortaya çıkan
değişkenleri dikkate alan, ve sözlük hazırlanmasında işe yarayacak
şekilde düzenlenmiş bir derleme oluşturmak olmalıdır.