The Use of Corpus Linguistics in         Lexicography       An Integrative Review           Lexicography            ENGL 6...
The Use of Corpus Linguistics in Lexicography                              An integrative literature reviewI. Introduction...
II. Literature Reviewa. Pre-corpus linguisticsRobert CawdreysTable Alphabeticall(1604) is considered to be the first monol...
4) Etymology and derivation. It is important to know the etymology of the word because it       is hard to discern which w...
1) The Word to be explained.   2) The Pronunciation and Accent.   3) The Various Forms assumed by the word, and its princi...
language. The study gave prominence to collocation - words that naturally co-occurtogether.Aimed to represent varieties of...
seem to be less painstaking, it also has its problem, mainly the copyright issue. Still some textsthat come from books, ma...
mentions two kinds of frequency information that lexicographers can obtain from a corpus:frequencies of occurrence of ling...
Wordsmithversion         is    5.0,       and    can      be      downloaded        online      at:http://www.lexically.ne...
classroom teacher explaining the words. For example in describing the word junk, it says: “Youcan use junk to refer to old...
One of the most prominent uses of a corpus in recent years is as a resource forlexicography. There was a corpus-based work...
National Corpus (BNC), the Linguistic Data Consortium (LDC), the Consortium for LexicalResearch (CLR), Electronic Dictiona...
sentence at a time. Besides the reading, it also hides the annotation and asks the students toannotate the sentences on th...
III. Discussions and Conclusions:From the reviewed literature, it could be dictionary has been around centuries ago. The f...
Corpus linguistics serve some linguistic purpose and to preserve the texts due to theintrinsic value in the texts (Hunston...
References:Armstrong, S. (1994). Using Large Corpora. Cambridge: MIT Press.Baugh, A. C. & Cable, T. (2002).A History of th...
Teubert, W. (2004).„Language and corpus linguistics‟.Lexicology and Corpus  Linguistics.London: Continuum.Tognini, E., Bon...
Upcoming SlideShare
Loading in...5
×

The Use of Corpus Linguistics in Lexicography

4,637

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,637
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
118
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "The Use of Corpus Linguistics in Lexicography"

  1. 1. The Use of Corpus Linguistics in Lexicography An Integrative Review Lexicography ENGL 6203 Submitted by: IhsanIbadurrahman (G1025429) SyareenIzzatyBtMajelan (G1029580) RudianaRazali (G1115202)
  2. 2. The Use of Corpus Linguistics in Lexicography An integrative literature reviewI. IntroductionThe practice of dictionary-making began as early as 1600 when Robert Cawdreyincluded wordsthat were deemed difficult as they were borrowed from another language into his version of thedictionary (Siemens, 1994). The words from the dictionary were taken from Latin-Englishdictionaries and also available texts of the time and were given concise definitions, synonym anda fixed form (Siemens, 1994). It was Samuel Johnson who explicitly introduced the methods orsteps that weretaken to create his dictionary in the 1700s and some of the methods were thenfollowed by the committee entrusted to create “A New Dictionary” or currently known as theOxford English Dictionary in the 1800s. A corpus is a collection of samples of authentic spoken and written text which are usedfor analysis of words, meanings, grammar and usage (David, 1992). In Saussurian terminology,the text is akin to that of parole, while the corpus provides the evidence of langue(Tognini&Bonelli, 2001). The term corpus linguistics is used when a corpus is specifically usedto study a language. Lindquist (2009: 1) distinguishes the term with other branches of linguisticssuch as sociolinguistics (the study of language and society), or psycholinguistics (the study oflanguage and the mind) in that corpus linguistics is a specific method used in language study, the“how to” rather than the “what”. In other words, corpus linguistics is an approach rather than aspecific field of language study (Gries, 2009). This paper aims to highlight major findings in the literature on corpus linguistics withanadded emphasis on its use in dictionary-making. In developing this integrative literature review,18 sources were obtained:13 books, 2 journal articles, and 3 online articles. After all theliterature is reviewed, recurring ideas found in the literature are compared, listed, and discussed.For ease of reading, the literature has been categorized into separate subheadings, namely, pre-corpus era, the initial corpus, and the present corpus. 1
  3. 3. II. Literature Reviewa. Pre-corpus linguisticsRobert CawdreysTable Alphabeticall(1604) is considered to be the first monolingual Englishdictionary ever made even though glosses of words have been made prior to Cawdreysdictionary (Jackson, 2002). Cawdreys dictionary consisted of 2543 hard words whichcomprised of loanwords that were considered difficult to be learned by the uneducated readerwhere the words were gathered from Latin-English dictionaries, glosses of religious, legal andscientific texts (Siemens, 1994).Cawdrey provided a concise definition of each word, a synonymor explanatory phrase and fixed form of many of the difficult words (Siemens, 1994; Jackson2002). After the conception of Cawdreys dictionary, a lot of effort have been made to better thequality of the dictionary and the subsequent dictionaries were made according to the methodsemployed by Cawdrey which was extracting hard words from different texts and including theminto the dictionary. It was in 1755 that Samuel Johnson published a two volume dictionary that he worked onfor 9 years (Jackson, 2002). It became the standard for English dictionary for 150 years beforethe conception of the Oxford English Dictionary in England and was the first dictionary that usedquotes to indicate how each word was used (Baugh & Cable, 2002). Johnson in his letter to hispatron wrote that he had faced difficulties in adding a word into the dictionary in the followingorder: 1) Selecting words. Johnson had to decide on which words that he wanted to include in the dictionary and classify each word whether they are foreign or belong to English since a lot of borrowing has been made from other languages. He also had to decide if words from specific professions should be included in the dictionary. 2) Orthography. Johnson proposed that no change should be made to the spelling of words without a sufficient reason because change would only cause inconvenience to others and is a mark of weakness or inconsistency. 3) Pronunciation. Johnson says that along with orthography, pronunciation should also be constant because stability in a language is important to the lifespan of a language and any changes would create almost new speech which would corrupt spoken English of that time. 2
  4. 4. 4) Etymology and derivation. It is important to know the etymology of the word because it is hard to discern which words are native to English with the amount of borrowings from different languages. 5) Analogy. The rules that governed how the words are used are included. 6) Syntax. The construction of each word is shown because the construction of English is too inconsistent that it would be difficult to be reduced to only rules. 7) Phraseology. The phrases in which the word is used are included to illustrate the different ways the words can be used. 8) Interpretation. Compared to the previous steps, Johnson considers interpretation of a word to be the most difficult part of creating the dictionary because he had to look at the different usages of each word and come up with thebest explanation of the word. 9) Distribution. After all the above mentioned steps have been taken, Johnson then slotted each word into their proper classes. After more than 150 years being the main source of reference with several revisions,Johnson‟s dictionary was found to be inadequate for the standards of modern scholarship(Jackson, 2002). So in 1857 a committee was appointed to collect words that are not in thedictionary to be added as a supplement but the committee found that it was not enough and in1858 it was decided a new dictionary should be created (Baugh & Cable, 2002; Jackson, 2002).The main aims of the new project were to record every word that can be found in English fromabout the year 1000 and to exhibit the history of each from a selection of quotations from thewhole range of English writings (Baugh & Cable, 2002). They gathered a total of six millionslips containing quotations from volunteers not only from England but from all over the world aswell. After 24 years of hard work, they managed to publish the first instalment of the dictionarythat covers part of the letter A in 1884. Another 16 years passed when four and a half volume ofdictionary was published until the letter H. Finally in 1928, the final section of the dictionary wasissued making the effort to create "A New Dictionary" successful after 70 years and now knownas the Oxford English Dictionary (OED) (Baugh & Cable, 2002). The committee came up withrules that have to be observed by the editors of OED before a word can be included in thedictionary in the following order (Considine, 1996): 3
  5. 5. 1) The Word to be explained. 2) The Pronunciation and Accent. 3) The Various Forms assumed by the word, and its principal grammatical inflexions. 4) The Etymon of the word. 5) The Cognate Forms in kindred languages. 6) The Meanings which are logically deduced from the Etymology, and arranged to show the common thread or threads which unite them together. Even though over a century has passed since Johnson created his dictionary, some of thesteps taken by Johnson were still used while creating the OED. This shows that the methodsemployed by Johnson were still relevant to lexicographers and were the main steps to be taken inmaking a dictionary before corpus linguistics was introduced in dictionary making.b. The initial stage of corpus linguisticsIn 1950s, there was a growing dissatisfaction of how language theory (e.g. Noam Chomsky‟ssyntactic structure) could not reason out the many „ungrammatical‟ patterns found in English(i.e. distinction between transitive and intransitive verbs). There was a strong call for empirical,real language data (Teubert, 2004). It was then that corpus was invented. The first corpus wasmade out of a survey of English usage conducted by two universities, University of London andthe Brown University Corpus in Providence. In the 1960s,both compiled its million word corpusof written text from 500 reading passages, which was named Brown Corpus. This Americancorpus was a landmark in corpus linguistics since it was the first corpus to employ a computer inits making. In 1982, the British version of the corpus, named the LOB corpus was compiledbyHofland and Johansson. LOB is an abbreviation from The Lancaster-Oslo-and Bergen, and as itsname suggests it is a collaborative attempt between the three universities: the University ofLancster, the University of Oslo, and the University of Norwegian Computing Centre of theHumanities. However, both the Brown corpus and LOB corpus were deemed to be inadequate tosample English vocabulary. This gave birth to John Sinclair‟s English Lexical Studieswhichspecifically aimed to investigate vocabulary using an electronic text of spoken and written 4
  6. 6. language. The study gave prominence to collocation - words that naturally co-occurtogether.Aimed to represent varieties of English where it is used as a first or second language,Sidney Greenbaum compiled one-million-word corpora called The International Corpus ofEnglish in 1988. The unique feature of this corpus is that it samples more spoken language(60%) than its written counterpart (40%). In the early 1990s, major universities and companies together compiled British NationalCorpus (BNC) containing 100 million words from 1980 up to 1993. The compilers were OxfordUniversity Press, Longman, Chambers, the British Library, Oxford University and LancasterUniversity. The aim of the corpus is to provide a balanced corpus that represents British English.The corpus includes 10% spoken language and 90% written language, which comprises of 25%fiction and 75% non-fiction. One big distinction between BNC and Brown is that the former tooksamples from a longer piece of text between 40,000 and 50,000 words. This gives BNC an addedadvantage of being representative since text contains a different use of words at the beginning, inthe middle, and at the end (Lindquist, 2009). Due to its sheer size, representativeness, and care,most British publishers prefer to make use of this corpus as their source of lexicographicinformation. Typically, any corpora will need to go through a three-step process in its making. Beforegoing through these three steps, however the writer needs to determine the basic outlines of acorpus such as the size of the corpus, the genre of the corpus, whether it will specifically lookinto written, spoken language, or both. Sinclair (1996) points out that the principles underlyingcorpus creation should be as large as possible including samples from a broad range of materialin order to accomplish one way of representativeness to be anticipated with the technology of thetime. The corpus should also be classified into different genres and even size. Once this basicoutlines is determined, the three-step process may begin. It starts with collecting the data, spokenand/or written. It entails gathering a large mass of speech, written texts, obtaining permission,and doing a careful and organized record-keeping. The next step is computerization which entailsconverting raw spoken or written text into a digital format in a computer. Recording of speechmay be painstaking sinceit needs to be transcribed manually. Another concern with spoken textis the issue of naturalness of the speech; it needs to be recorded in a natural, casual way thatresembles how people speak every day in real life, not in a stilted way. Though written records 5
  7. 7. seem to be less painstaking, it also has its problem, mainly the copyright issue. Still some textsthat come from books, magazines, and other written sources need to be retyped since scanningdevice such as OCR (Optical character recognition) software that detect and scan wordsautomatically usually contain errors, so many that it‟s best to avoid using them altogether. Thelast step is annotating, which involves assigning information such as parts of speech, etymology,for each data. It should be noted that the three aforementioned steps need not to be seen as aseparate process; they are all closely connected. For example, after gathering recording ofspeech, it may be best to transcribe it there and then. Corpus may have given a lot of contributions in language study, but its impact tolexicography did not start until 1989. Together with the advance of computer software, both havesince contributed significantly to the development of lexicography.Since everything is automatedand recorded in a digital format, lexicographers can now save their time and the tremendousamount of work needed in compiling a dictionary. Typically, a dictionary usually hasinformation on the part of speech, usage, meaning, pronunciation, etymology of a word. Beforethe advent of corpora, all this information had to be gathered manually; lexicographers needed todo the hard labor of collecting slips of paper containing text that they intend to include in thedictionary. For this reason, it took roughly 50 years to complete Oxford English Dictionary,which was later known as New English Dictionary(Meyer, 2002). With corpora, dictionarymakers can now usea large sample of authentic spoken and written textas a source to illustratehow each word in their list is used in real life. The citation used in dictionary comes from real-life discourse. Real contexts also provide accurate, well-defined lexical meanings in thedefinition of a word in dictionary, which is a huge improvement over the previous dictionarypractice where words were defined using an unscientific manner. One huge improvement indictionary making is the rich information available for words that have many invariant meaningssuch as take, go, and time, whichtend to be overlooked in the previous dictionary practice(Lindquist, 2009). Another huge advantage of using corpora in lexicography is that information on wordfrequency can also be obtained. This way, lexicographers can assign whether a word is amongthe first 500 most common words, the next 500 and so on.Meyer (2002) notes that the mostfrequent words are functional words such as the, an, a, and, and of which carry little lexicalmeaning and the least frequent words are content words such as proper nouns. Gries (2009) 6
  8. 8. mentions two kinds of frequency information that lexicographers can obtain from a corpus:frequencies of occurrence of linguistic elements in the so-called frequency list, and frequenciesofco-occurrence of these linguistic elements in concordances. Lindquist (2009: 5) definesconcordance as “a list of all the contexts in which a word occurs in a particular text”. Using aKey Word in Context (KWIC) concordance, words can be retrieved within theirsurrounding text,and be presentedvertically on the screen. Since the information is presented in contexts,lexicographers can easily assign the collocations of each word in their dictionary. Below is anexcerpt from concordance software in which the word “corpus” is highlighted. Figure 1: Concordance from a software called AntConc 3.2.2w (Gries, 2009). The above figure illustrates concordance software called AntConct in use. It should benoted that the software does not come with a ready-made corpus. Hence, users need to readilyhave a file to generate a KWIC output. The latest version of the software is 3.2.4w and can bedownloaded online at http://www.antlab.sci.waseda.ac.jp/software.html. Similar software thatlexicographers may use to find how words are used in context is wordsmith tools, devised byMike Scott in 1993. Since then the software has gone through a lot of changes which nowinclude a concordance, word-listing, web text downloader and many other features (Wikipedia,2011). Previous versions of the software were sold and owned by Oxford University Press. Thesoftware‟s current version is now owned by Lexical Analysis Software Ltd. The current 7
  9. 9. Wordsmithversion is 5.0, and can be downloaded online at:http://www.lexically.net/wordsmith/version5/index.html. However, unlike AntConc, Wordsmithis a shareware. In order to unlock the demo version from the website, user will need to pay asingle-user license of £50 or around $70-80 from two online retailers (Lexical SoftwareAnalysis, and Oxford University Press). Since corpus is discourse-based, it means that the word appears inhaphazard, arbitrarycollection of occurrences, as illustrated in the figure above. Dictionary makers need to check forsome contradictions with „real‟ meaning. It is thus dangerous to solely depend on corpus(Teubert, 2004).One way to check the word in context is to expand the text by retrieving itsoriginal source. Such feature is lacking in both software mentioned previously: the AntConc andWordsmith tools. Fortunately, the feature is thankfully available for free from BirminghamYoung University Website, which provides a concordance containing BNC, COCA (Corpus ofContemporary American English), and some other corpora and can be accessed at:http://corpus.byu.edu/ The huge amount of data in the corpus also allows lexicographers to look for new wordsthat occur for the first time in spoken or written text. However, the corpus has to be largeenough to glean information on vocabulary items (Meyer, 2002). A small corpus such as LOBcorpus which stores roughly one million word items could not give lexicographers enoughinformation on the range of vocabulary items. A monitor corpus is also needed, in which largedata of language is pooled from time to time, rather than fixed only in one particular time period.This way, the corpus is frequently updated with new words and meanings in today‟s growinglanguage. The first dictionary to be founded wholly on corpus is Collins COBUILD series ofEnglish Language Dictionary compiled in 1987, guided by John Sinclair. The dictionary has itscitation taken from real life discourse, and each word is defined from these authentic texts,instead of relying on previous dictionary. This entails using a very large corpus so that it may beable to include all lemmas including their word senses. However, this presents problem in thatthere tends to be an exclusion of rare words such as apothegm(Teubert, 2004). Besides being thefirst corpus-based dictionary, COBUILD is innovative in that the definitions are akin to a 8
  10. 10. classroom teacher explaining the words. For example in describing the word junk, it says: “Youcan use junk to refer to old and second-hand goods that people buy and collect” (Jackson, 2002). In the practice of dictionary-making, one crucial distinction has to be made betweencorpus-based dictionary and corpus-driven dictionary. Dictionaries such as Collins COBUILDseries of English Language dictionaries are said to be corpus-driven if the corpus itself is used tovalidate information presented in the dictionary. However, if the corpus is used to extract theinformation used in the dictionary, it is called corpus-driven. Teubert (2004: 112) suggests thatdictionary should follow corpus-driven approach so that it may complement standard linguisticsand not just extend it.c. Modern corpus linguistics During the 1970s, computational research on English had not developed much inBirmingham because heavy preparation was spent towards devising software packages,instituting undergraduate courses and influencing opinions on the campus (Sinclair, 1991). Onthat time, when computing was almost restricted to a number of crises, there was a highlight forthe importance of data- processing. It has taken approximately fifty years to make a realimprovement in the area of corpus- based linguistics which has been driven by systems that workand methodologies that can produce reasonable coverage of linguistic condition (Lawler & Dry,1998). Years after years, there has been a realization of emergence on accessibility ofcomputational resources such as fast machines and sufficient storage in order to process largevolumes of data. Besides that, in the modern corpus, there is a growing availability of corporawith linguistics annotations, for example, part of speech, prosodic intonation, proper names, andbilingual parallel corpora. Furthermore, the maturity of computational linguistics technology hasimproved the commercial market for natural language product and the corpus linguisticsnowadays has been equipped by efficient parsing and statistical techniques. From 1980 to 1986, computational language was put to good effect which transformedinto a completely new set of techniques for language observation, analysis, and recording. This isas well bringing to the development of editing substantial dictionaries by using technique andhuge database of annotated examples. 9
  11. 11. One of the most prominent uses of a corpus in recent years is as a resource forlexicography. There was a corpus-based work for a small number of languages that was used inlexicography. Only recently the need for very large corpora has come to the front. TheLexicography and Natural Language Processing (NLP) collaboration has incited the use ofcorpora in dictionary projects that have had access to very large corpora (Hua, 2001). The role of the computer has a clerical role in lexicography which reducing the labor ofsorting and filing and examining very large amounts of English in a short time (Sinclair, 1991).In the late 1970s, the prospects of computerized typesetting were growing more realistic. Tenyears later, in the early 1980s, a multi-million word corpus became available for study but stilllimited. From simple tools, it has evolved to a substantial progress together with crucial,profound and basic linguistic generalizations (Lawler & Dry, 1998). By these kinds of developedtools, they have revealed many topics for inquiry which have not been well explored bytraditional linguistic methods.In the modern era, the word has been reserved for collections of texts that are stored andaccessed electronically. Electronic corpora are usually larger than the paper-based collectionswhich are basically small, previously used to study the aspect of language (Hunston, 2002).Thisis due to the capacity of computers that can store and process large amount of informationcompared to the previous time. One of the work in the area of corpus linguistics is from the work done by Johansson andcollegues in producing a parallel corpus of British English have made it possible for researchworkers to scrutinize and visualize physically texts of greater length compared to the timebefore. The main structural features of these corpora are: - A classification into genres (15) of printed texts - A large number (500) of fairly short extracts (2000 words), giving a total of around one million words. - A close to random selection of extracts within genres. Due to this, a great amount of useful information can be extracted easily from thecorpora. Besides that, many locations have samples of text which provide hundreds of billions ofwords. Many collections available such as Association for Computational Linguistics‟ DataCollection Initiative (ACL/DCI), the European Corpus Initiative (ECI), ICAME, The British 10
  12. 12. National Corpus (BNC), the Linguistic Data Consortium (LDC), the Consortium for LexicalResearch (CLR), Electronic Dictionary Research (EDR), and standardization efforts such as theText Encoding Initiative (TEI) (Armstrong, 1994). The application of corpora in applied linguistics is also extended to the language teachingapart from the area of lexicography. It has benefited into a wide variety of field. Other relevantapplications of corpora are to the production of dictionaries and grammars, in critical linguistics,translation, literary studies and stylistic, forensic linguistics and designing writer supportpackages (Hunston, 2002). In relation towards the dictionary making, corpora have a contribution towards the areawhich is most far-reaching and influential. The use of corpora has changed dictionaries in a waythat it has stressed on frequency, collocation and phraseology, variation, lexis in grammar andauthenticity (Hunston, 2002). Recent innovations of dictionaries include the on-line LongmanWeb Dictionary and the Collins COBUILD English Collocations on CD ROM. Sinclair (1996) points out that the principles underlying corpus creation should be aslarge as possible including samples from a broad range of material in order to accomplish oneway of representativeness to be anticipated with the technology of the time. The corpus shouldalso be classified into different genres and even size.d. The use of corpora in language teachingThe method of using corpora in the disciplines of many studies is not uncommon (McEnery&Wilson, 1996:4). Apart from Lexicography, other possible areas include Language Teaching,Discourse and Pragmatics, Semantics, Sociolinguistics, Historical linguistics and Stylistic.Within the area of Language teaching, we also have another branch known as CALL (Computer-Assisted Language Learning), where it provides a further application of corpora. There is a studyconducted at Lancaster University towards the role of corpus-based computer software forteaching undergraduates the basis concept of grammatical analysis (Hua, 2001). The software iscalled Cytor which reads an annotated corpus, including part-of-speech tagged or parsed, in one 11
  13. 13. sentence at a time. Besides the reading, it also hides the annotation and asks the students toannotate the sentences on their own. In addition, students could call up help in the form of the listof tag mnemonics, examples of frequency lexicon or concordances. How effective is the Cytor at teaching part-of-speech learning? A research carried outrelated to this was done by McEnery, Baker and Wilson (1995, cited in Hua, 2001) which aftercomparing two groups of students which have different treatments; one who were taught withCytor and another via traditional lecturer-based methods, the result suggests that the computer-taught students performed better than the human-taught students throughout the term. Another use of corpus in the language teaching and learning is the adaptation ofclassroom concordance (data driven learning) by classroom practitioner where corpus hasbecome a source for empirical teaching data (Hua:2001,5). One of the examples of link to Data-Driven Learning is Tim John‟s Home Page at http://web.bham.ac.uk/johnstf/. It provides anoutstanding resource of online web-based bibliographic database of books and articles related toCorpora and Language Teaching. Moreover, it has included online worksheets which involvingcorpora for classroom teaching. Another resource which is also quite interesting is the “GrammarSafari” site developed at Champaign-Urbana and can be found online athttp://deil.lang.uiuc.edu/web.pages/grammarsafari.html which provides careful and thoughtfulselection of corpus-based activities. Furthermore, the Longman Grammar of Spoken and WrittenEnglish by Douglas Biber et al to answer student questions related to grammar contribute to theuseful corpus categorized into fiction, conversation, news, etc. 12
  14. 14. III. Discussions and Conclusions:From the reviewed literature, it could be dictionary has been around centuries ago. The firstdictionary was made in the 1600s and was based on what was considered difficult words at thattime. During this initial stage, lexicographers faced some challenges in adding words into theirdictionaries: selecting words, orthography, pronunciation, etymology and derivation,analogy,syntax, phraseology, interpretation, distribution.All this information had to be gatheredmanually; lexicographers needed to do the hard labor of collecting slips of paper containing textthat they intend to include in the dictionary. For this reason, it took roughly 50 years to completeOxford English Dictionary, which was later known as New English Dictionary. However withthe advent of corpus linguistics, things began to change dramatically. In 1989, together with the technological advance in computer, corpus provided asignificant contribution to the development of dictionary making. Corpus linguistics made such ahuge impact in dictionary-making: a. It significantly reduces the time and the heavy work it needs to compile a dictionary since everything is automated and computerized. b. Each dictionary now resembles how language is used in real world. Meaning is assigned from these samples, rather than from the writer‟s point of view. c. Frequency of each word in the list can be assigned / identified. d. Much more information can be given to words with a lot of variant meanings such as go, and take. e. It makes it easy to include collocation because words appear in its surrounding text. f. It can quickly take „new‟ everyday words into the system. However, because corpus is discourse-based, it means that the word appears inhaphazard,arbitrary collection of occurrences. Dictionary makers need to check for some contradictionswith „real‟ meaning. It is thus dangerous to solely depend on corpus. Another disadvantage ofdictionaries that are corpora-based is that it tends to exclude rare words (not appearing in realworld language) such as apothegm.The first dictionary to ever make it corpus-based is CollinsCOBUILD series of English dictionaries. 13
  15. 15. Corpus linguistics serve some linguistic purpose and to preserve the texts due to theintrinsic value in the texts (Hunston, 2002). It also can be used as groundwork for research. Thestorage of a corpus allows the users to study it non-linearly and both quantitatively andqualitatively. The nature of a corpus does not include new information about language but tooffer us a new viewpoint on the given information. It shows us a way that language can beexamined. Most of available software packages process data from a corpus in three ways;showing frequency, phraseology, and collocation (Hunston , 2002). Corpora have made life simpler as well as more complex. In situations that corpora havemade the life of users simpler are, for example, when a translator could see quickly thecomparison of words that are more or less equivalent or a teacher could refer to the corpus whenhe or she wishes to show the reasons of why a particular usage is incorrect or inexact inexplanations. On the other hand corpora could also made life more complex in a sense thatlanguage is patterned in a much more fined way than what we might have been expected that asimple and general rule turns out to be applied only in certain context (Hunston, 2002). The modern corpusis reserved for collections of texts that are stored and accessedelectronically. Electronic corpora are usually larger than the paper-based collection which isbasically small, previously used to study the aspect of language. Electronic corpora gave birth tothe recent innovations of dictionaries, which include the on-line Longman Web Dictionary andthe Collins COBUILD English Collocations on CD ROM. 14
  16. 16. References:Armstrong, S. (1994). Using Large Corpora. Cambridge: MIT Press.Baugh, A. C. & Cable, T. (2002).A History of the English Language.Oxon: Routledge.Considine, J. (1996). The Meanings, deduced logically from etymology in Gellerstam, M.; JekerJäborg; Sven-GöranMalmgren; Kerstin Norén; Lena Rogström y CatarinaRöjderPammehl (eds.), Euralex ‘96 Proceedings. Papers submitted to the Seventh EURALEX International Congress on Lexicography in Göteborg, Sweden,Göteborg University - Department of Swedish, Göteborg, 1996, 365-371.David, C. (1992). An Encyclopedic Dictionary of Language and Languages. Oxford: Oxford University Press. Retrieved from: http://www.tuchemintz.de/phil/english/chairs/linguist/independent/kursmaterialien/language_ computers/whatis.htmGries, S.T. (2009). „What is Corpus Linguistics?‟,Language and Linguistics Compass, Vol. 3. pp.1-14Hua,T.K. (2001). Corpora: Characteristics and Related Studies. Kuala Lumpur: MazizaSdn Bhd.Hunston , S. (2002). Corpora in Applied Linguistics. UK : Cambridge University Press.Jackson, H. (2002). Lexicography, an Introduction. Oxon: Routledge.Johnson, S. (1747). The Plan of a Dictionary of the English Language.Lawler, J.M. &Dry,H.A. (1998). Using Computers in Linguistics: A Practical Guide. London: Routledge.Lindquist, H. (2009). Corpus Linguistics and the Description of English. Edinburgh: Edinburgh University Press.Mason, O. (2000).Programming for Corpus Linguistics:How to Do Text Analysis with Java. Edinburgh: Edinburgh University Press.Meyer, C.F. (2002). English Corpus Linguistics.Cambridge: Cambridge University Press.McEnery T. & Wilson, A. (1996).Corpus Linguistics. Edinburgh: Edinburgh University Press.Siemens, R. G. (1994). Robert Cawdrey: A Table Alphabetical of Hard Usual English Words (1604). Retrieved from http://www.library.utoronto.ca/utel/ret/cawdrey/cawdrey0.htmlSinclair,J. (1991). Corpus,Concordance,Collocation. Oxford: Oxford University Press. 15
  17. 17. Teubert, W. (2004).„Language and corpus linguistics‟.Lexicology and Corpus Linguistics.London: Continuum.Tognini, E., Bonelli. (2001). Corpus Linguistics at Work.Amsterdam: John Benjamins Publishing Co.WordSmith. (2011, October 15). In Wikipedia, The Free Encyclopedia. Retrieved April 22, 2012, from http://en.wikipedia.org/w/index.php?title=WordSmith&oldid=455732307 16

×