Your SlideShare is downloading. ×
0
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Machine Readable Dictionaries and NLP
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Machine Readable Dictionaries and NLP

125

Published on

An examination of the usability of machine-readable dictionaries for NLP, with suggestions for how they might be improved.

An examination of the usability of machine-readable dictionaries for NLP, with suggestions for how they might be improved.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
125
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Machine-Readable Dictionaries Challenges for Lexicography in NLP LSA Summer 2011 University of Colorado Boulder Orin Hargraves
  • 2. Word Sense Disambiguation (WSD) and Machine-readable Dictionaries (MRDs) ● WSD is essential for processing polysemous words in NLP then stacks them up in a neat pile.”</p><p>Recall the reader who dawned his wife’s panty livestock farm in western Illinois.” I can recall several Christmases when we were first ling, traceability of the food content and recall systems to be of a very high standard he changes introduced may be impossible to recall.</p><p>Personally, genetic modificatio contamination" is probably what caused the recall of the taco shells a few months ago. I t for its retro plot and visual style that recall the Val Lewton films of the 1940s, it scourses, then this article will hopefully recall the prominence of the original texts. he toes he got ain't too pretty. I seem to recall a bar fight in Houston. I was in Houst er candidates for Governor in California’s recall election. So are an additional 132 peo after one airing of Schwarzenegger’s Total Recall, which runs 113 minutes, could require titutionally ban same-sex marriage.</p><p> Recall, also, that the Canadian court could n roved corn into the food system.</p><p>The recall of the StarLink corn is wreaking havoc ● Machine WSD requires access to a sense inventory for any polysemous word or homographic form ●Dictionary databases provide a comprehensive sense inventory (ideally for all words and forms) ●Owners of dictionary databases are eager for the income stream that MRDs may offer, in a time of dwindling returns from print dictionaries. ●Machine WSD rarely does better than about 60% accuracy
  • 3. Word Sense Disambiguation (WSD) and Machine-readable Dictionaries (MRDs) ● WSD is essential for processing polysemous words in NLP then stacks them up in a neat pile.”</p><p>Recall the reader who dawned his wife’s panty livestock farm in western Illinois.” I can recall several Christmases when we were first ling, traceability of the food content and recall systems to be of a very high standard he changes introduced may be impossible to recall.</p><p>Personally, genetic modificatio contamination" is probably what caused the recall of the taco shells a few months ago. I t for its retro plot and visual style that recall the Val Lewton films of the 1940s, it scourses, then this article will hopefully recall the prominence of the original texts. he toes he got ain't too pretty. I seem to recall a bar fight in Houston. I was in Houst er candidates for Governor in California’s recall election. So are an additional 132 peo after one airing of Schwarzenegger’s Total Recall, which runs 113 minutes, could require titutionally ban same-sex marriage.</p><p> Recall, also, that the Canadian court could n roved corn into the food system.</p><p>The recall of the StarLink corn is wreaking havoc ● Machine WSD requires access to a sense inventory for any polysemous word or homographic form ●Dictionary databases provide a comprehensive sense inventory (ideally for all words and forms) ●Owners of dictionary databases are eager for the income stream that MRDs may offer, in a time of dwindling returns from print dictionaries. ●Machine WSD rarely does better than about 60% accuracy
  • 4. Word Sense Disambiguation (WSD) and Machine-readable Dictionaries (MRDs) ● WSD is essential for processing polysemous words in NLP then stacks them up in a neat pile.”</p><p>Recall the reader who dawned his wife’s panty livestock farm in western Illinois.” I can recall several Christmases when we were first ling, traceability of the food content and recall systems to be of a very high standard he changes introduced may be impossible to recall.</p><p>Personally, genetic modificatio contamination" is probably what caused the recall of the taco shells a few months ago. I t for its retro plot and visual style that recall the Val Lewton films of the 1940s, it scourses, then this article will hopefully recall the prominence of the original texts. he toes he got ain't too pretty. I seem to recall a bar fight in Houston. I was in Houst er candidates for Governor in California’s recall election. So are an additional 132 peo after one airing of Schwarzenegger’s Total Recall, which runs 113 minutes, could require titutionally ban same-sex marriage.</p><p> Recall, also, that the Canadian court could n roved corn into the food system.</p><p>The recall of the StarLink corn is wreaking havoc ● Machine WSD requires access to a sense inventory for any polysemous word or homographic form ●Dictionary databases provide a comprehensive sense inventory (ideally for all words and forms) ●Owners of dictionary databases are eager for the income stream that MRDs may offer, in a time of dwindling returns from print dictionaries. ●Machine WSD rarely does better than about 60% accuracy
  • 5. Word Sense Disambiguation (WSD) and Machine-readable Dictionaries (MRDs) ● WSD is essential for processing polysemous words in NLP then stacks them up in a neat pile.”</p><p>Recall the reader who dawned his wife’s panty livestock farm in western Illinois.” I can recall several Christmases when we were first ling, traceability of the food content and recall systems to be of a very high standard he changes introduced may be impossible to recall.</p><p>Personally, genetic modificatio contamination" is probably what caused the recall of the taco shells a few months ago. I t for its retro plot and visual style that recall the Val Lewton films of the 1940s, it scourses, then this article will hopefully recall the prominence of the original texts. he toes he got ain't too pretty. I seem to recall a bar fight in Houston. I was in Houst er candidates for Governor in California’s recall election. So are an additional 132 peo after one airing of Schwarzenegger’s Total Recall, which runs 113 minutes, could require titutionally ban same-sex marriage.</p><p> Recall, also, that the Canadian court could n roved corn into the food system.</p><p>The recall of the StarLink corn is wreaking havoc ● Machine WSD requires access to a sense inventory for any polysemous word or homographic form ●Dictionary databases provide a comprehensive sense inventory (ideally for all words and forms) ●Owners of dictionary databases are eager for the income stream that MRDs may offer, in a time of dwindling returns from print dictionaries. ●Machine WSD rarely does better than about 60% accuracy
  • 6. Word Sense Disambiguation (WSD) and Machine-readable Dictionaries (MRDs) ● WSD is essential for processing polysemous words in NLP then stacks them up in a neat pile.”</p><p>Recall the reader who dawned his wife’s panty livestock farm in western Illinois.” I can recall several Christmases when we were first ling, traceability of the food content and recall systems to be of a very high standard he changes introduced may be impossible to recall.</p><p>Personally, genetic modificatio contamination" is probably what caused the recall of the taco shells a few months ago. I t for its retro plot and visual style that recall the Val Lewton films of the 1940s, it scourses, then this article will hopefully recall the prominence of the original texts. he toes he got ain't too pretty. I seem to recall a bar fight in Houston. I was in Houst er candidates for Governor in California’s recall election. So are an additional 132 peo after one airing of Schwarzenegger’s Total Recall, which runs 113 minutes, could require titutionally ban same-sex marriage.</p><p> Recall, also, that the Canadian court could n roved corn into the food system.</p><p>The recall of the StarLink corn is wreaking havoc ● Machine WSD requires access to a sense inventory for any polysemous word or homographic form ●Dictionary databases provide a comprehensive sense inventory (ideally for all words and forms) ●Owners of dictionary databases are eager for the income stream that MRDs may offer, in a time of dwindling returns from print dictionaries. ●Machine WSD rarely does better than about 60% accuracy
  • 7. Where do MRDs come from? ● Most are secondary products from dictionaries intended for human users, in which . . . ● Pertinent entry elements are tagged, and software is developed for access via human or machine query ● No standard protocols exist for conversion of dictionary databases to MRDs ● WordNet(s) are unique in being “purpose built” MRDs – though they contain mainly human- friendly, conventional definitions
  • 8. MRDs: For and Against – Thousands of hours of work are already done ● Lexicographer input constitutes expert WSD ● Dictionary databases may contain not only definitions but features useful for WSD like collocations, idioms, spelling variants, inflections and synonymies ● Ready-made sense inventory ● Some MRDs are free! – There is wide disparity in sense division among dictionaries ● Even single dictionaries show little ontological consistency ● Nearly all dictionaries display circularity among some word families and synsets ● Sense inventories do not reflect actual usage ● Definitions assume human knowledge
  • 9. Disparities in sense inventories (bug n) RHUD MW11 NOAD 1. Also called true bug, hemipteran, hemipteron. a hemipterous insect. 2. (loosely) any insect or insectlike invertebrate. 3. Informal. any microorganism, esp. a virus: He was laid up for a week by an intestinal bug. 4. Informal. a defect or imperfection, as in a mechanical device, computer program, or plan; glitch: The test flight discovered the bugs in the new plane. 5. Informal. a. a person who has a great enthusiasm for something; fan or hobbyist: a hi-fi bug. b. a craze or obsession: He's got the sports-car bug. 6. Informal. a. a hidden microphone or other electronic eavesdropping device. 7. Horse Racing. the five-pound weight allowance that can be claimed by an apprentice jockey. (etc.) 1 a : an insect or other creeping or crawling invertebrate (as a spider or centipede) b : any of several insects (as the bedbug or cockroach) commonly considered obnoxious c : any of an order (Hemiptera and especially its suborder Heteroptera) of insects that have sucking mouthparts, forewings thickened at the base, and incomplete metamorphosis and are often economic pests — called also true bug 2 : an unexpected defect, fault, flaw, or imperfection <the software was full of bugs> 3 a : a germ or microorganism especially when causing disease b : an unspecified or nonspecific sickness usually presumed due to a bug 4 : a sudden enthusiasm 5 : ENTHUSIAST <a camera bug> 6 : a prominent person 7 : a crazy person 8 : a concealed listening device 9 : a weight allowance given apprentice jockeys 1 a small insect. ■ informal a harmful microorganism, as a bacterium or virus. ■ an illness caused by such a microorganism: suffering from a flu bug ■ [with adjective] figurativ,e informal an enthusiastic, almost obsessive, interest in something: they caught the sailing bug | Joe was bitten by the showbiz bug. 2 (also true bug) Entomolgy an insect of a large order distinguished by having mouthparts that are modified for piercing and sucking. •Order Hemiptera: see HEMIPTERA. 3 a miniature microphone, typically concealed in a room or telephone, used for surveillance. 4 an error m a computer program or system.
  • 10. Lack of Ontological Consistency
  • 11. Lack of Ontological Consistency chamberpot bedpan
  • 12. Lack of Ontological Consistency a bedpan is a . . . MW11 CED RHUD ODE AHD Wik EWED MED CACD WU WN Cent vessel    toilet pan  receptacle   chamber pot  container    pan  utensil  a chamberpot is a . . . MW11 CED RHUD ODE AHD Wik EWED MED CACD WU WN Cent vessel     toilet pan receptacle  chamber pot container   pan bowl   
  • 13. Circularity (here, in generic terms) MW11 a is a kind of receptacle container container receptacle vessel container utensil (…) vessel CED a is a kind of receptacle object container object vessel object utensil (…) container RHUD a is a kind of receptacle container container (anything that contains) vessel utensil utensil (…) vessel ODE a is a kind of receptacle object container object vessel container utensil (…) container AHD a is a kind of receptacle container container receptacle vessel utensil utensil (…) container WordNet a is a kind of receptacle container container instrumentality vessel container utensil implement
  • 14. Sense Inventories Don't Reflect Usage " Yes. The stare. The laser look the jut-jawed coach shoots any UT miscreant who plays lackadaisically or stupidly. " I never really get past the eyes, advice to all of those doubting academic highbrows out there. To quote that animated miscreant Bart Simpson, " Don't have a cow, man! " This actually merely of wayfarers but of entire intellectual traditions. # The name of this huge miscreant is Critical Thinking - a name uttered by professors and students with more awe than , Winnipeg, Manitoba, Canada # A: American Express is not the only miscreant here. We have received several letters just like yours about credit-card companies. In ARS Western Regional Research Center in Albany, California. Conquering Caulerpa -- A Marine Miscreant # Sometimes referred to as " killer algae, " C. taxifolia flourishes in warm China's state-directed economy without a fully convertible currency while lambasting Japan as an economic miscreant. This downgrading of U.S.-Japan ties is particularly painful because it violates the highest virtue proliferating cells. It turns out that, in at least some cases, a miscreant protein traps p53, explains Princeton's Levine. P53 can't get anywhere near threat to the notions of causality which underlie our understanding of the universe. The miscreant tachyon velocities, Paul Birch proposed in 1984, may be ruled out by some the need for Western aid. # Incidentally, " The Ukraine " is a miscreant phrase from the days of the Czarist Empire. Ukraine is a recognized independent nation , " Martin wrote later, " which is a pit stop for the average miscreant. But when we passed his office, I realized that I wasn't average
  • 15. Human Knowledge Required! bedpan a necessary utensil for the use of persons confined to bed (Century) clipping something cut out or trimmed off, esp. an article from a newspaper (CED) draw vt extract (an object or liquid) from a container or receptacle (NOAD) hide to put someone or something in a place where they cannot be seen or found, or to put yourself somewhere where you cannot be seen or found (CDAE) mangle to spoil, injure, or make incoherent especially through ineptitude (MW11) restaurant a building where people go to eat (WordNet) spade 2. some implement, piece, or part resembling this (RHUD) drop 5b. mention in passing, typically in order to impress (ODE) bombshell 1. an unexpected and surprising event, especially an unpleasant one (ODE)
  • 16. A Core Problem: Lumping and Splitting ● Humans split lumpiness automatically (by discarding nonsense and impertinent information) ● Computers are largely clueless as to what is nonsense, and where logic limits lumpiness ● Splitty, very specific definitions are easier for machines to identify, however . . . ● They're irritating to humans, and take up much more space (and processing time)
  • 17. An alternative to MRDs: WMDs MWDs Machine-Written Dictionaries ● MRD “entries” can be supplemented with machine-harvested, machine-readable data ● Human-friendly lumpiness can be mitigated with the addition of disambiguating features ● Other inputs can support “human knowledge” and flesh out the implicit parts of definitions ● Corpus data can identify sense inventory gaps ● Many “gold standard” inputs are readily available and underexploited
  • 18. Just the Word: collocational data
  • 19. Just the Word ● URL: www.just-the-word.com ● Data Owner: Sharp Laboratories ● Underlying Data: British National Corpus ● Main purpose: catalog of collocational patterns ● Possibly useful for: extraction of most frequent collocations and bigrams; some bigrams and triples not collected by Word Sketches (see next)
  • 20. Word Sketches
  • 21. Word Sketches ● URL: www.sketchengine.co.uk ● Data Licenser: Lexical Computing, Ltd. ● Underlying Data: numerous corpora ● Main purpose: aid to lexicographers and researchers ● Possibly useful for: statistical profiling of sense frequency; identification of idioms, phrasal verbs, compounds, and other “chunks”. Corpus Query Language allows for extensive flexibility in data retrieval.
  • 22. Oxford Sentence Dictionary
  • 23. Oxford Sentence Dictionary ● URL: http://dws-sketch.uk.oup.com (?) ● Data Owner: Oxford University Press ● Underlying Data: Oxford English Corpus and World Wide Web ● Main purpose: collection of example sentences for ESL and other purposes ● Possibly useful for: Lesk-like approach to WSD; sense identification by pattern matching.
  • 24. FrameNet (via FrameNet Explorer)
  • 25. FrameNet ● URL: http://framenet.icsi.berkeley.edu/ ● Data Owner: UC Berkeley ● Underlying Data: BNC and other ● Main purposes: manifold ● Possibly useful for: complementary to Sketch Engine and OSD ● Further reading: “The Contribution of FrameNet to Practical Lexicography,” Atkins et al, IJL 16:3 (2003)
  • 26. Disambiguation of Collocations
  • 27. Disambiguation of Collocations ● 90% of V+N and N+V collocations resolve to a single sense for each ● 10% of these represent multiple senses and require further context to disambiguate, e.g. bring case [V* obj N]: I didn't have enough evidence to bring the case to court. letter refer [N* subj V]: The letters 'c' and 'x' refer to the dilution factor used.
  • 28. Disambiguation of Collocations ● 90% of V+N and N+V collocations resolve to a single sense for each ● 10% of these represent multiple senses and require further context to disambiguate, e.g. bring case [V* obj N]: I didn't have enough evidence to bring the case to court. Her husband Ian brought a case of wine and a box of glasses. letter refer [N* subj V]: The letters 'c' and 'x' refer to the dilution factor used. Michael's letter refers very frequently to 'export-oriented consumed-productivity standards.'
  • 29. Disambiguation of Collocations ● URL: none (data is not currently online) ● Data Owner: University of Rome, La Sapienza (Roberto Navigli) ● Underlying Data: BNC and Just the Word; WordNet ● Main purpose: disambiguation of collocations ● Possibly useful for: a bigram dictionary (N+V and V+N only)
  • 30. The Century Dictionary
  • 31. The Century Dictionary ● URL: http://www.global-language.com/CENTURY/ and http://www.archive.org/details/centurydictionar11wh ● Data Owner: public domain ● Underlying data: late 19th century English ● Main purpose: “a work of universal reference in all departments of knowledge” ● Possibly useful for: same!
  • 32. Thanks! www.orinhargraves.com

×