lexicographic evidence


Published on

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

lexicographic evidence

  1. 1. Lexicographic Evidence In this part how to design acquire and process a design, acquire, collection of linguistic data which will form the raw material for a dictionary is going to be explained explained.
  2. 2. Comprehension Q C h i Questions (1) ti 1. What is a reliable dictionary? 2. What is subjective evidence and its limits? 3. 3 What is a citation? 4. What should be the basic steps in setting up a reading programme? 5. What 5 Wh t are th advantages and di d the d t d disadvantages of t f citations?
  3. 3. Comprehension Q C h i Questions (2) ti 6. What is a corpus? 7. What are the points that should be considered in g g p designing a corpus? 8. How large should a corpus be? 9. How do we decide what kinds of written or spoken material our corpus should include? 10. Can a corpus be representative?
  4. 4. Comprehension Q C h i Questions (3) ti 11. What i ‘ k i ’? 11 Wh t is ‘skewing’? 12. What 12 Wh are the questions that should be h i h h ld b answered before starting to form the corpus? 13. What is linguistic annotation?
  5. 5. A ‘R li bl ’ Di ti ‘Reliable’ Dictionary A reliable dictionary is one whose generalizations about word behavior approximate closely to the ways in which people normally use language when engaging in real communicative acts. Yet, it is difficult to determine how people normally p p y use words. There is a need for evidence.
  6. 6. Subjective Evidence and Its Limits Introspection: consulting your own mental l l l lexicon, is a form of evidence, but it cannot form the basis of a reliable dictionary alone, since one individual’s store of linguistic k li i ti knowledge i l d is i inevitably i it bl incomplete and l t d idiosyncratic. Informant-testing: in which speakers of a language are questioned about their use of words, is also of limited value for mainstream lexicography for similar reasons. g p y Both f h B h of them are essentially subjective f i ll bj i forms of f evidence. Creating a reliable dictionary involves a number of challenging tasks, but it is for sure that the observation of language in use is the indispensable first stage in the f g g p f g process.
  7. 7. Citations Cit ti A citation is a short extract from a text which provides evidence for a word, phrase, usage, or meaning i d h i in authentic use. Until the late twentieth century, the OED’s citations would be written in longhand on index cards known as slips. slips These were filed alphabetically according to the keyword of the citation. it ti
  8. 8. DNA If a blog has a common ancestor with the diary one can say that it diary, has a DNA. E.g. MySpace E g ‘MySpace’ shares at least some of its DNA with the ‘scrapbook’.
  9. 9. Setting up a Reading Programme d Some dictionary publishers provide online forms to enable members of the public to contribute citations Most of these publishers citations. get unusable citations since their programmes are not well-planned. A good reading p g g programme, on the other hand, will often have great value.
  10. 10. Setting up a Reading Programme d There is a need for at least four main data fields: 1- keyword or phrase: the usage that the citation illustrates, filed under the headword to which it relates. 2- the citation itself: usually a single sentence is adequate, but there may be more than one. 3- Information about the source of the citation: the date, title, and author’s name are all important; additional information ( (such as the page number) may be useful for specialized or p g ) y p historical dictionaries. 4- a comment field: this gives readers the option of adding a c mm nt f th g r a r th pt n f a ng note to clarify the citation; it may, for example, be a new meaning that needs explaining, or it may be characteristic of one particular dialect.
  11. 11. Advantages of Cit ti Ad t f Citations 1- they are helpful to monitor language change y p g g g 2 2- They give information about the terminology from a specific subject field or a particular variety or dialect. y 3 3- They are helpful in training the lexicographers
  12. 12. Disadvantages of Cit ti Di d t f Citations 11 Collecting data in this way is labour intensive labour-intensive, so volumes will always be low. 2- Although instances of usage are authentic, there is a bi s bj ti th big subjective element in th i l m nt their selection
  13. 13. The Central Role of the Corpus h l l f h Citation bank alone - even the largest one – cannot usually supply language data in the required volumes so the case for a large q m f g corpus is clear. A “corpus” is a collection of pieces of language text in electronic form, selected g g , according to external criteria to represent, as far as possible, a language or language variety as a source of data for li i fd f linguistic i i research (Sinclair 2005).
  14. 14. Some I S Inescapable T th bl Truths There is no such thing as a perfect corpus for g p p lexicography. F First of all, the corpus is a sample. It is not possible to f , p mp . p examine every extant example of usage for the languages. To create a sample that fairly reflects the wider population, there is a need for carefully selected criteria. Secondly, selecting texts on the basis of their ‘quality’, and excluding those which fail this test, is fundamentally at odds with th d s ipti ith the descriptive ethos of corpus lin isti s Wh is t th s f p s linguistics. Who to judge which texts are ‘good’, and on what basis? It is clear that a lexicographic corpus must be a genuine – and inclusivesnapshot of a language, not a set of texts that have been specially chosen to advance someone’s notion of what constitutes ‘good’ usage.
  15. 15. Corpora: Design Issues C D i I Designing a corpus means making decisions about: 11 how large it will be be. 22 which broad categories of text it will include include. 33 what proportions of each category it will include include. 4- hi h i di id l texts 4 which individual t t it will include. ill i l d
  16. 16. Size: How large is large enough? It i f sure th t th more d t we h is for that the data have th more we the learn. Yet, there are also some hypotheses on the size of the corpus. Zipf’s Law predicts that the tenth most frequent word in a corpus will occur twice as often as the 20th most frequent word, ten times as often as the 100th most frequent word, and 100 times as often as q , the 1000th most frequent word. Thus, it can be said that in a corpus of 100 million words, a simple right or left sorted corpus clearly shows most of the normal patterns of usage for all words except the very rare.
  17. 17. Different texts, different styles ff d ff l However large its size may be if the words are be, taken from only a limited area (for instance from newspapers), they cannot represent all aspects of the language, and th results m th l n nd the s lts may b misl din be misleading. (For instance; the meaning of the word party will most frequently occur as a political organization q y p g rather than a social event. A corpus consisting of a single type of text will reflect only the stylistic and subject-matter features of that particular genre. It will as corpus linguistics say, a ‘skewed’ corpus. Therefore, the corpus should include different texts and d ff d ff d different styles. l
  18. 18. Can a Corpus be Representative? The standard way of avoiding bias is to collect a ‘random sample’. Yet Y t random s d sampling may not represent th l li t s t the language well. O ll One partial solution is to apply stratified sampling. This involves breaking up the total population into a number of subcategories or types, then creating independent random samples from each of these groupings. But this immediately raises two questions: g p g y q 1- How do we define these subcategories? 2 2- How do we decide what proportions of each subcategory the corpus should include?
  19. 19. It is almost impossible to define the population that the corpus should be representative of, and since the population is unlimited, it is d h l l d logically impossible to establish ‘correct’ proportions of each component. An achievable ti f h t A hi bl objective should be “a balanced corpus”.
  20. 20. Selecting Texts S l ti T t The corpus collection is usually recursive. p y First some texts from a range of sources are gathered Next the texts are analyzed to identify recurring clusters f g f . of linguistic features. It enables us to establish provisional categories of texts, grouped on the basis of shared linguistic features. Then more texts are collected to reflect these feature distributions. Then the analysis is repeated on the enlarged corpus, on more texts. The process thus proceeds in a cyclical fashion until we collect a large corpus whose contents reflect the proportions in which the various key features are observable in large bodies f text. b di of t t
  21. 21. Spoken D t A S S k Data: Special C i l Case With a corpus of spoken language, there are no language obvious objective measures that can be used to define the target population. The spoken data population should represent the variables like gender, social class, age and religion. The conversations , g g that form the corpus should reflect the diversity of the spoken language.
  22. 22. A Note on ‘Skewing’ N t ‘Sk i ’ Skewing refers to a form of bias in data whereby a particular feature is either over or under represented to a degree that distorts the general picture. As corpora grow larger, usually problems with skewing gradually recede. yp gg y
  23. 23. There are some questions that should be answered before starting to form the corpus. Language: Will the corpus be monolingual, bilingual, or g g p g g multilingual? This is an important question before starting to form the corpus. Time: Will the corpus be synchronic or diachronic? In a synchronic corpus, the constituent texts come from one specific period of time, whereas the texts making p p g up a diachronic corpus come from an extended period. Mode: Will the corpus include written texts, spoken texts or both? The status of the chat room conversations which have the characteristics of both spoken and written texts is another point that require p p q attention in corpus formation.
  24. 24. Medium M di Medium refers to the channel in which the text appears. A simple classification here would distinguish print media and spoken media. The former in l d f m include b ks n sp p s m books, newspapers, magazines, in s journals, dissertations, movie scripts, government documents and legal statutes. Spoken media g p include face-to-face conversations, broadcasts and podcasts, public meetings, and educational settings. Once again traditional categories became blurred again, when we add the web to the mix. Some ‘new’ text types (blogs and social networking sites, for example) are exclusive to the web, b l ) l h b but many documents exist in both print and electronic media.
  25. 25. Dealing ith S bl D li with Sublanguages When we think about the vocabulary of a language, it is useful to make a broad distinction between core usages and sublanguages. The word deuce is part of a sublanguage: it belongs to the vocabulary of tennis. tennis A word like important, on the other hand, belongs to the core vocabulary of English. The following question arises at this g f g q point: will we include the sublanguages?
  26. 26. Collecting Written Data In the past, the work of lexicographers was p g p not so easy. Earlier corpora made extensive use of scanning and keyboarding which were both l b h slow and l b d labour-intensive processes. Today it is possible to find the digital form of various t t i texts.
  27. 27. Collecting Spoken Data Traditionally, spoken data has been difficult rad t onally, d ff cult and extensive to collect. Consequently, although the majority of communicative events g j y in a language occur in spoken mode, few corpora include high proportions of spoken material. For instance, only 10 per cent of the BNC is spoken. Nowadays, web-derived spoken data hi h ff d t which offers up-to-date material i l t d t t i l in large quantities and at low cost begins to look like an attractive alternative alternative.
  28. 28. Collecting Data from the Web The Th question of ‘‘whether th web is a sti f h th the b corpus’ is a hotly debated topic in language engineering circles. For g p y, lexicography, it is better to see the web as a source of texts from which a lexicographic corpus can be assembled.
  29. 29. Sample Size There are arguments for using complete texts rather than extracts. In many registers, the discourse structure and g rhetorical f h l features of a text may vary as it f proceeds from its opening paragraphs, through its central sections, to the concluding chapters. The BNC’s solution to this was to ensure that 40000 word samples were taken variously from the beginning beginning, middle, and end of its source documents.
  30. 30. Copyright and Permissions C i ht d P i i Unless a corpus is made up of much older texts, most of its source material is likely to be protected by copyright. S corpus-builders should get permissions i ht So, b ild h ld t i i from the copyright owners to include the documents in their corpus. This is not an easy task. It is one of the most time consuming aspects of the project It is project. recommended that the corpus builders should never offer to pay for permission to include a text. Once money starts changing hands a precedent would be hands, established that could have fatal consequences to corpus-creation efforts worldwide.
  31. 31. Processing and Annotating g g the Data To give the final f g f form to the corpus f p from its raw state, some operations are carried on.
  32. 32. Clean-up, standardization, p and text encoding Essentially the process of taking a heterogeneous collection of input document collect on nput and converting them all to a standard, usable form. For instance, non-linguistic sounds in g spoken data (like erm, ooh, mhm) and unusable texts in written data (like indexes, tables, diagrams) are not included in the corpus.
  33. 33. Documentation D i Providing each input text with a unique ‘header document’ which records its essential header document wh ch ts essent al features. Headers typically give bibliographic information (title, author’s name, date and place of publication, and the like) and precisely locate each text in whatever typology is being used.
  34. 34. Linguistic Annotation Enriching raw text by adding grammatical information which will enable corpus users to frame sophisticated queries and extract p q maximum benefit from the data. For instance, She is tagged as a personal pronoun, and R ll is tagged as a general d Really d l adverb. A well-tagged corpus allows us to focus on each pattern in turn and view a manageable number of examples.
  35. 35. Final Thoughts Fi l Th ht In this part, a methodology for building a corpus for use in lexicography has been p g p y outlined. It is for sure that this is a difficult task, and there is no perfect corpus since p p language is diverse and dynamic. The aim is to form a balanced, standardized, well-tagged gg corpus. For many kinds of research, a corpus with meticulously detailed headers and finey grained linguistic annotation is precisely what is needed.
  36. 36. Turkish Summary: Sözlüksel Kanıt Bu bölümde oluşturulması planlanan bir sözlüğe kaynaklık edecek olan verilerin nasıl tasarlandığı, toplandığını ve işlendiğini anlatılmıştır. Öncelikle, sözlüğü hazırlayacak kişilerin kendi sözcük dağarcıklarının önemi vurgulanmalıdır. Ancak şurası kesin ki ne kadar geniş olursa olsun bireysel sözcük dağarcığı böyle bir çalışma için yeterli değildir. Geçmişte en sık kullanılan veri toplama yöntemi alıntılama yöntemi idi idi. Günümüzde bu yöntem eskiye oranla önemini biraz yitirse de hala kullanılmaktadır. Hatta, internet üzerinden, gönüllülerden veriler toplanması amacıyla özel programlar geliştirilmiştir. Bilgisayar ve internet teknolojilerinin gelişmesiyle en fazla kullanılan veri toplama yöntemi derleme yöntemi olmuştur Bölüm boyunca derlemeyi olmuştur. hazırlayan kişilerin pek çok soruyla karşı karşıya oldukları ve işlerinin ne derece zor olduğu vurgulanmıştır. Dil çeşitlilik gösteren ve dinamik bir yapıdadır bu nedenle mükemmel bir derleme yapılması beklenemez. Sözlükçülerin amacı dengeli, kullanılan dili en iyi temsil eden, dili kullanan kişilere ve dilin kullanıldığı ortamlara göre ortaya çıkan değişkenleri dikkate alan, ve sözlük hazırlanmasında işe yarayacak şekilde düzenlenmiş bir derleme oluşturmak olmalıdır.