SlideShare a Scribd company logo
Machine-Readable Dictionaries
Challenges for Lexicography in NLP
LSA Summer 2011
University of Colorado
Orin Hargraves
Word Sense Disambiguation (WSD) and
Machine-readable Dictionaries (MRDs)
● WSD is essential for processing polysemous words in NLP
then stacks them up in a neat pile.”</p><p>Recall the reader who dawned his wife’s panty
livestock farm in western Illinois.” I can recall several Christmases when we were first
ling, traceability of the food content and recall systems to be of a very high standard
he changes introduced may be impossible to recall.</p><p>Personally, genetic modificatio
contamination" is probably what caused the recall of the taco shells a few months ago. I
t for its retro plot and visual style that recall the Val Lewton films of the 1940s, it
scourses, then this article will hopefully recall the prominence of the original texts.
he toes he got ain't too pretty. I seem to recall a bar fight in Houston. I was in Houst
er candidates for Governor in California’s recall election. So are an additional 132 peo
after one airing of Schwarzenegger’s Total Recall, which runs 113 minutes, could require
titutionally ban same-sex marriage.</p><p> Recall, also, that the Canadian court could n
roved corn into the food system.</p><p>The recall of the StarLink corn is wreaking havoc
Machine WSD requires access to a sense inventory for any
polysemous word or homographic form
●Dictionary databases provide a comprehensive sense
inventory (ideally for all words and forms)
●Owners of dictionary databases are eager for the income
stream that MRDs may offer, in a time of dwindling returns
from print dictionaries.
●Machine WSD rarely does better than about 60% accuracy
Word Sense Disambiguation (WSD) and
Machine-readable Dictionaries (MRDs)
● WSD is essential for processing polysemous words in NLP
then stacks them up in a neat pile.”</p><p>Recall the reader who dawned his wife’s panty
livestock farm in western Illinois.” I can recall several Christmases when we were first
ling, traceability of the food content and recall systems to be of a very high standard
he changes introduced may be impossible to recall.</p><p>Personally, genetic modificatio
contamination" is probably what caused the recall of the taco shells a few months ago. I
t for its retro plot and visual style that recall the Val Lewton films of the 1940s, it
scourses, then this article will hopefully recall the prominence of the original texts.
he toes he got ain't too pretty. I seem to recall a bar fight in Houston. I was in Houst
er candidates for Governor in California’s recall election. So are an additional 132 peo
after one airing of Schwarzenegger’s Total Recall, which runs 113 minutes, could require
titutionally ban same-sex marriage.</p><p> Recall, also, that the Canadian court could n
roved corn into the food system.</p><p>The recall of the StarLink corn is wreaking havoc
Machine WSD requires access to a sense inventory for any
polysemous word or homographic form
●Dictionary databases provide a comprehensive sense
inventory (ideally for all words and forms)
●Owners of dictionary databases are eager for the income
stream that MRDs may offer, in a time of dwindling returns
from print dictionaries.
●Machine WSD rarely does better than about 60% accuracy
Word Sense Disambiguation (WSD) and
Machine-readable Dictionaries (MRDs)
● WSD is essential for processing polysemous words in NLP
then stacks them up in a neat pile.”</p><p>Recall the reader who dawned his wife’s panty
livestock farm in western Illinois.” I can recall several Christmases when we were first
ling, traceability of the food content and recall systems to be of a very high standard
he changes introduced may be impossible to recall.</p><p>Personally, genetic modificatio
contamination" is probably what caused the recall of the taco shells a few months ago. I
t for its retro plot and visual style that recall the Val Lewton films of the 1940s, it
scourses, then this article will hopefully recall the prominence of the original texts.
he toes he got ain't too pretty. I seem to recall a bar fight in Houston. I was in Houst
er candidates for Governor in California’s recall election. So are an additional 132 peo
after one airing of Schwarzenegger’s Total Recall, which runs 113 minutes, could require
titutionally ban same-sex marriage.</p><p> Recall, also, that the Canadian court could n
roved corn into the food system.</p><p>The recall of the StarLink corn is wreaking havoc
Machine WSD requires access to a sense inventory for any
polysemous word or homographic form
●Dictionary databases provide a comprehensive sense
inventory (ideally for all words and forms)
●Owners of dictionary databases are eager for the income
stream that MRDs may offer, in a time of dwindling returns
from print dictionaries.
●Machine WSD rarely does better than about 60% accuracy
Word Sense Disambiguation (WSD) and
Machine-readable Dictionaries (MRDs)
● WSD is essential for processing polysemous words in NLP
then stacks them up in a neat pile.”</p><p>Recall the reader who dawned his wife’s panty
livestock farm in western Illinois.” I can recall several Christmases when we were first
ling, traceability of the food content and recall systems to be of a very high standard
he changes introduced may be impossible to recall.</p><p>Personally, genetic modificatio
contamination" is probably what caused the recall of the taco shells a few months ago. I
t for its retro plot and visual style that recall the Val Lewton films of the 1940s, it
scourses, then this article will hopefully recall the prominence of the original texts.
he toes he got ain't too pretty. I seem to recall a bar fight in Houston. I was in Houst
er candidates for Governor in California’s recall election. So are an additional 132 peo
after one airing of Schwarzenegger’s Total Recall, which runs 113 minutes, could require
titutionally ban same-sex marriage.</p><p> Recall, also, that the Canadian court could n
roved corn into the food system.</p><p>The recall of the StarLink corn is wreaking havoc
Machine WSD requires access to a sense inventory for any
polysemous word or homographic form
●Dictionary databases provide a comprehensive sense
inventory (ideally for all words and forms)
●Owners of dictionary databases are eager for the income
stream that MRDs may offer, in a time of dwindling returns
from print dictionaries.
●Machine WSD rarely does better than about 60% accuracy
Word Sense Disambiguation (WSD) and
Machine-readable Dictionaries (MRDs)
● WSD is essential for processing polysemous words in NLP
then stacks them up in a neat pile.”</p><p>Recall the reader who dawned his wife’s panty
livestock farm in western Illinois.” I can recall several Christmases when we were first
ling, traceability of the food content and recall systems to be of a very high standard
he changes introduced may be impossible to recall.</p><p>Personally, genetic modificatio
contamination" is probably what caused the recall of the taco shells a few months ago. I
t for its retro plot and visual style that recall the Val Lewton films of the 1940s, it
scourses, then this article will hopefully recall the prominence of the original texts.
he toes he got ain't too pretty. I seem to recall a bar fight in Houston. I was in Houst
er candidates for Governor in California’s recall election. So are an additional 132 peo
after one airing of Schwarzenegger’s Total Recall, which runs 113 minutes, could require
titutionally ban same-sex marriage.</p><p> Recall, also, that the Canadian court could n
roved corn into the food system.</p><p>The recall of the StarLink corn is wreaking havoc
Machine WSD requires access to a sense inventory for any
polysemous word or homographic form
●Dictionary databases provide a comprehensive sense
inventory (ideally for all words and forms)
●Owners of dictionary databases are eager for the income
stream that MRDs may offer, in a time of dwindling returns
from print dictionaries.
●Machine WSD rarely does better than about 60% accuracy
Where do MRDs come from?
● Most are secondary products from dictionaries
intended for human users, in which . . .
● Pertinent entry elements are tagged, and
software is developed for access via human or
machine query
● No standard protocols exist for conversion of
dictionary databases to MRDs
● WordNet(s) are unique in being “purpose built”
MRDs – though they contain mainly human-
friendly, conventional definitions
MRDs: For and Against
– Thousands of hours
of work are already
● Lexicographer input
constitutes expert WSD
● Dictionary databases may
contain not only definitions
but features useful for WSD
like collocations, idioms,
spelling variants, inflections
and synonymies
● Ready-made sense inventory
● Some MRDs are free!
– There is wide
disparity in sense
division among
● Even single dictionaries show
little ontological consistency
● Nearly all dictionaries display
circularity among some word
families and synsets
● Sense inventories do not
reflect actual usage
● Definitions assume human
Disparities in sense inventories (bug n)
1. Also called true bug, hemipteran,
hemipteron. a hemipterous insect.
2. (loosely) any insect or insectlike
3. Informal. any microorganism, esp.
a virus: He was laid up for a week by
an intestinal
4. Informal. a defect or imperfection,
as in a mechanical device, computer
program, or
plan; glitch: The test flight
discovered the bugs in the new
5. Informal. a. a person who has a
great enthusiasm for something; fan
or hobbyist: a hi-fi bug.
b. a craze or obsession: He's got the
sports-car bug.
6. Informal. a. a hidden microphone
or other electronic eavesdropping
7. Horse Racing. the five-pound
weight allowance that can be
claimed by an apprentice
1 a : an insect or other creeping or
crawling invertebrate (as a spider or
centipede) b : any of several insects
(as the bedbug or cockroach)
commonly considered obnoxious c :
any of an order (Hemiptera and
especially its suborder Heteroptera)
of insects that have sucking
mouthparts, forewings thickened at
the base, and incomplete
metamorphosis and are often
economic pests — called also true
2 : an unexpected defect, fault, flaw,
or imperfection <the software was full
of bugs>
3 a : a germ or microorganism
especially when causing disease b :
an unspecified or nonspecific
sickness usually presumed due to a
4 : a sudden enthusiasm
5 : ENTHUSIAST <a camera bug>
6 : a prominent person
7 : a crazy person
8 : a concealed listening device
9 : a weight
allowance given apprentice jockeys
1 a small insect. ■ informal a harmful
microorganism, as a bacterium or virus. ■
an illness caused by such a microorganism:
suffering from a flu bug ■ [with adjective]
figurativ,e informal an enthusiastic, almost
obsessive, interest in something: they
caught the sailing bug | Joe was bitten by
the showbiz bug.
2 (also true bug) Entomolgy an insect of a
large order distinguished by having
mouthparts that are modified for piercing
and sucking. •Order Hemiptera: see
3 a miniature microphone, typically
concealed in a room or telephone, used for
4 an error m a computer program or
Lack of Ontological Consistency
Lack of Ontological Consistency
Lack of Ontological Consistency
a bedpan is a . . .
vessel   
toilet pan 
receptacle  
chamber pot 
container   
pan 
utensil 
a chamberpot is a . . .
vessel    
toilet pan
receptacle 
container  
bowl   
Circularity (here, in generic terms)
a is a kind of
receptacle container
container receptacle
vessel container
utensil (…) vessel
a is a kind of
receptacle object
container object
vessel object
utensil (…) container
a is a kind of
receptacle container
container (anything that contains)
vessel utensil
utensil (…) vessel
a is a kind of
receptacle object
container object
vessel container
utensil (…) container
a is a kind of
receptacle container
container receptacle
vessel utensil
utensil (…) container
a is a kind of
receptacle container
container instrumentality
vessel container
utensil implement
Sense Inventories Don't Reflect Usage
" Yes. The stare. The laser look the jut-jawed coach shoots any UT miscreant
who plays lackadaisically or stupidly. " I never really get past the eyes,
advice to all of those doubting academic highbrows out there. To quote that
animated miscreant Bart Simpson, " Don't have a cow, man! " This actually
merely of wayfarers but of entire intellectual traditions. # The name of
this huge miscreant is Critical Thinking - a name uttered by professors and
students with more awe than
, Winnipeg, Manitoba, Canada # A: American Express is not the only miscreant
here. We have received several letters just like yours about credit-card
companies. In
ARS Western Regional Research Center in Albany, California. Conquering
Caulerpa -- A Marine Miscreant # Sometimes referred to as " killer algae, " C.
taxifolia flourishes in warm
China's state-directed economy without a fully convertible currency while
lambasting Japan as an economic miscreant. This downgrading of U.S.-Japan ties
is particularly painful because it violates the highest virtue
proliferating cells. It turns out that, in at least some cases, a miscreant
protein traps p53, explains Princeton's Levine. P53 can't get anywhere near
threat to the notions of causality which underlie our understanding of the
universe. The miscreant tachyon velocities, Paul Birch proposed in 1984, may
be ruled out by some
the need for Western aid. # Incidentally, " The Ukraine " is a miscreant
phrase from the days of the Czarist Empire. Ukraine is a recognized
independent nation
, " Martin wrote later, " which is a pit stop for the average miscreant. But
when we passed his office, I realized that I wasn't average
Human Knowledge Required!
bedpan a necessary utensil for the use of persons confined to bed (Century)
clipping something cut out or trimmed off, esp. an article from a newspaper
draw vt extract (an object or liquid) from a container or receptacle (NOAD)
hide to put someone or something in a place where they cannot be seen or
found, or to put yourself somewhere where you cannot be seen or found (CDAE)
mangle to spoil, injure, or make incoherent especially through ineptitude
restaurant a building where people go to eat (WordNet)
spade 2. some implement, piece, or part resembling this (RHUD)
drop 5b. mention in passing, typically in order to impress (ODE)
bombshell 1. an unexpected and surprising event, especially an unpleasant
one (ODE)
A Core Problem:
Lumping and Splitting
● Humans split lumpiness automatically (by
discarding nonsense and impertinent
● Computers are largely clueless as to what is
nonsense, and where logic limits lumpiness
● Splitty, very specific definitions are easier for
machines to identify, however . . .
● They're irritating to humans, and take up much
more space (and processing time)
An alternative to MRDs: WMDs MWDs
Machine-Written Dictionaries
● MRD “entries” can be supplemented with
machine-harvested, machine-readable data
● Human-friendly lumpiness can be mitigated
with the addition of disambiguating features
● Other inputs can support “human knowledge”
and flesh out the implicit parts of definitions
● Corpus data can identify sense inventory gaps
● Many “gold standard” inputs are readily
available and underexploited
Just the Word: collocational data
Just the Word
● URL:
● Data Owner: Sharp Laboratories
● Underlying Data: British National Corpus
● Main purpose: catalog of collocational patterns
● Possibly useful for: extraction of most frequent
collocations and bigrams; some bigrams and
triples not collected by Word Sketches (see
Word Sketches
Word Sketches
● URL:
● Data Licenser: Lexical Computing, Ltd.
● Underlying Data: numerous corpora
● Main purpose: aid to lexicographers and
● Possibly useful for: statistical profiling of sense
frequency; identification of idioms, phrasal
verbs, compounds, and other “chunks”. Corpus
Query Language allows for extensive flexibility
in data retrieval.
Oxford Sentence Dictionary
Oxford Sentence Dictionary
● URL: (?)
● Data Owner: Oxford University Press
● Underlying Data: Oxford English Corpus and
World Wide Web
● Main purpose: collection of example sentences
for ESL and other purposes
● Possibly useful for: Lesk-like approach to WSD;
sense identification by pattern matching.
FrameNet (via FrameNet Explorer)
● URL:
● Data Owner: UC Berkeley
● Underlying Data: BNC and other
● Main purposes: manifold
● Possibly useful for: complementary to Sketch
Engine and OSD
● Further reading: “The Contribution of FrameNet
to Practical Lexicography,” Atkins et al, IJL 16:3
Disambiguation of Collocations
Disambiguation of Collocations
● 90% of V+N and N+V collocations resolve to a single sense for
● 10% of these represent multiple senses and require further
context to disambiguate, e.g.
bring case [V* obj N]:
I didn't have enough evidence to bring the case to court.
letter refer [N* subj V]:
The letters 'c' and 'x' refer to the dilution factor used.
Disambiguation of Collocations
● 90% of V+N and N+V collocations resolve to a single sense for
● 10% of these represent multiple senses and require further
context to disambiguate, e.g.
bring case [V* obj N]:
I didn't have enough evidence to bring the case to court.
Her husband Ian brought a case of wine and a box of
letter refer [N* subj V]:
The letters 'c' and 'x' refer to the dilution factor used.
Michael's letter refers very frequently to 'export-oriented
consumed-productivity standards.'
Disambiguation of Collocations
● URL: none (data is not currently online)
● Data Owner: University of Rome, La Sapienza
(Roberto Navigli)
● Underlying Data: BNC and Just the Word;
● Main purpose: disambiguation of collocations
● Possibly useful for: a bigram dictionary (N+V
and V+N only)
The Century
The Century Dictionary
● URL:
● Data Owner: public domain
Underlying data: late 19th
century English
● Main purpose: “a work of universal reference in
all departments of knowledge”
● Possibly useful for: same!

More Related Content

Recently uploaded

Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
Timothy Spann
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa

Recently uploaded (20)

Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake


Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
Alireza Esmikhani
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
Project for Public Spaces & National Center for Biking and Walking
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
DevGAMM Conference
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Erica Santiago
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software

Featured (20)

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well

Machine Readable Dictionaries and NLP

  • 1. Machine-Readable Dictionaries Challenges for Lexicography in NLP LSA Summer 2011 University of Colorado Boulder Orin Hargraves
  • 2. Word Sense Disambiguation (WSD) and Machine-readable Dictionaries (MRDs) ● WSD is essential for processing polysemous words in NLP then stacks them up in a neat pile.”</p><p>Recall the reader who dawned his wife’s panty livestock farm in western Illinois.” I can recall several Christmases when we were first ling, traceability of the food content and recall systems to be of a very high standard he changes introduced may be impossible to recall.</p><p>Personally, genetic modificatio contamination" is probably what caused the recall of the taco shells a few months ago. I t for its retro plot and visual style that recall the Val Lewton films of the 1940s, it scourses, then this article will hopefully recall the prominence of the original texts. he toes he got ain't too pretty. I seem to recall a bar fight in Houston. I was in Houst er candidates for Governor in California’s recall election. So are an additional 132 peo after one airing of Schwarzenegger’s Total Recall, which runs 113 minutes, could require titutionally ban same-sex marriage.</p><p> Recall, also, that the Canadian court could n roved corn into the food system.</p><p>The recall of the StarLink corn is wreaking havoc ● Machine WSD requires access to a sense inventory for any polysemous word or homographic form ●Dictionary databases provide a comprehensive sense inventory (ideally for all words and forms) ●Owners of dictionary databases are eager for the income stream that MRDs may offer, in a time of dwindling returns from print dictionaries. ●Machine WSD rarely does better than about 60% accuracy
  • 3. Word Sense Disambiguation (WSD) and Machine-readable Dictionaries (MRDs) ● WSD is essential for processing polysemous words in NLP then stacks them up in a neat pile.”</p><p>Recall the reader who dawned his wife’s panty livestock farm in western Illinois.” I can recall several Christmases when we were first ling, traceability of the food content and recall systems to be of a very high standard he changes introduced may be impossible to recall.</p><p>Personally, genetic modificatio contamination" is probably what caused the recall of the taco shells a few months ago. I t for its retro plot and visual style that recall the Val Lewton films of the 1940s, it scourses, then this article will hopefully recall the prominence of the original texts. he toes he got ain't too pretty. I seem to recall a bar fight in Houston. I was in Houst er candidates for Governor in California’s recall election. So are an additional 132 peo after one airing of Schwarzenegger’s Total Recall, which runs 113 minutes, could require titutionally ban same-sex marriage.</p><p> Recall, also, that the Canadian court could n roved corn into the food system.</p><p>The recall of the StarLink corn is wreaking havoc ● Machine WSD requires access to a sense inventory for any polysemous word or homographic form ●Dictionary databases provide a comprehensive sense inventory (ideally for all words and forms) ●Owners of dictionary databases are eager for the income stream that MRDs may offer, in a time of dwindling returns from print dictionaries. ●Machine WSD rarely does better than about 60% accuracy
  • 4. Word Sense Disambiguation (WSD) and Machine-readable Dictionaries (MRDs) ● WSD is essential for processing polysemous words in NLP then stacks them up in a neat pile.”</p><p>Recall the reader who dawned his wife’s panty livestock farm in western Illinois.” I can recall several Christmases when we were first ling, traceability of the food content and recall systems to be of a very high standard he changes introduced may be impossible to recall.</p><p>Personally, genetic modificatio contamination" is probably what caused the recall of the taco shells a few months ago. I t for its retro plot and visual style that recall the Val Lewton films of the 1940s, it scourses, then this article will hopefully recall the prominence of the original texts. he toes he got ain't too pretty. I seem to recall a bar fight in Houston. I was in Houst er candidates for Governor in California’s recall election. So are an additional 132 peo after one airing of Schwarzenegger’s Total Recall, which runs 113 minutes, could require titutionally ban same-sex marriage.</p><p> Recall, also, that the Canadian court could n roved corn into the food system.</p><p>The recall of the StarLink corn is wreaking havoc ● Machine WSD requires access to a sense inventory for any polysemous word or homographic form ●Dictionary databases provide a comprehensive sense inventory (ideally for all words and forms) ●Owners of dictionary databases are eager for the income stream that MRDs may offer, in a time of dwindling returns from print dictionaries. ●Machine WSD rarely does better than about 60% accuracy
  • 5. Word Sense Disambiguation (WSD) and Machine-readable Dictionaries (MRDs) ● WSD is essential for processing polysemous words in NLP then stacks them up in a neat pile.”</p><p>Recall the reader who dawned his wife’s panty livestock farm in western Illinois.” I can recall several Christmases when we were first ling, traceability of the food content and recall systems to be of a very high standard he changes introduced may be impossible to recall.</p><p>Personally, genetic modificatio contamination" is probably what caused the recall of the taco shells a few months ago. I t for its retro plot and visual style that recall the Val Lewton films of the 1940s, it scourses, then this article will hopefully recall the prominence of the original texts. he toes he got ain't too pretty. I seem to recall a bar fight in Houston. I was in Houst er candidates for Governor in California’s recall election. So are an additional 132 peo after one airing of Schwarzenegger’s Total Recall, which runs 113 minutes, could require titutionally ban same-sex marriage.</p><p> Recall, also, that the Canadian court could n roved corn into the food system.</p><p>The recall of the StarLink corn is wreaking havoc ● Machine WSD requires access to a sense inventory for any polysemous word or homographic form ●Dictionary databases provide a comprehensive sense inventory (ideally for all words and forms) ●Owners of dictionary databases are eager for the income stream that MRDs may offer, in a time of dwindling returns from print dictionaries. ●Machine WSD rarely does better than about 60% accuracy
  • 6. Word Sense Disambiguation (WSD) and Machine-readable Dictionaries (MRDs) ● WSD is essential for processing polysemous words in NLP then stacks them up in a neat pile.”</p><p>Recall the reader who dawned his wife’s panty livestock farm in western Illinois.” I can recall several Christmases when we were first ling, traceability of the food content and recall systems to be of a very high standard he changes introduced may be impossible to recall.</p><p>Personally, genetic modificatio contamination" is probably what caused the recall of the taco shells a few months ago. I t for its retro plot and visual style that recall the Val Lewton films of the 1940s, it scourses, then this article will hopefully recall the prominence of the original texts. he toes he got ain't too pretty. I seem to recall a bar fight in Houston. I was in Houst er candidates for Governor in California’s recall election. So are an additional 132 peo after one airing of Schwarzenegger’s Total Recall, which runs 113 minutes, could require titutionally ban same-sex marriage.</p><p> Recall, also, that the Canadian court could n roved corn into the food system.</p><p>The recall of the StarLink corn is wreaking havoc ● Machine WSD requires access to a sense inventory for any polysemous word or homographic form ●Dictionary databases provide a comprehensive sense inventory (ideally for all words and forms) ●Owners of dictionary databases are eager for the income stream that MRDs may offer, in a time of dwindling returns from print dictionaries. ●Machine WSD rarely does better than about 60% accuracy
  • 7. Where do MRDs come from? ● Most are secondary products from dictionaries intended for human users, in which . . . ● Pertinent entry elements are tagged, and software is developed for access via human or machine query ● No standard protocols exist for conversion of dictionary databases to MRDs ● WordNet(s) are unique in being “purpose built” MRDs – though they contain mainly human- friendly, conventional definitions
  • 8. MRDs: For and Against – Thousands of hours of work are already done ● Lexicographer input constitutes expert WSD ● Dictionary databases may contain not only definitions but features useful for WSD like collocations, idioms, spelling variants, inflections and synonymies ● Ready-made sense inventory ● Some MRDs are free! – There is wide disparity in sense division among dictionaries ● Even single dictionaries show little ontological consistency ● Nearly all dictionaries display circularity among some word families and synsets ● Sense inventories do not reflect actual usage ● Definitions assume human knowledge
  • 9. Disparities in sense inventories (bug n) RHUD MW11 NOAD 1. Also called true bug, hemipteran, hemipteron. a hemipterous insect. 2. (loosely) any insect or insectlike invertebrate. 3. Informal. any microorganism, esp. a virus: He was laid up for a week by an intestinal bug. 4. Informal. a defect or imperfection, as in a mechanical device, computer program, or plan; glitch: The test flight discovered the bugs in the new plane. 5. Informal. a. a person who has a great enthusiasm for something; fan or hobbyist: a hi-fi bug. b. a craze or obsession: He's got the sports-car bug. 6. Informal. a. a hidden microphone or other electronic eavesdropping device. 7. Horse Racing. the five-pound weight allowance that can be claimed by an apprentice jockey. (etc.) 1 a : an insect or other creeping or crawling invertebrate (as a spider or centipede) b : any of several insects (as the bedbug or cockroach) commonly considered obnoxious c : any of an order (Hemiptera and especially its suborder Heteroptera) of insects that have sucking mouthparts, forewings thickened at the base, and incomplete metamorphosis and are often economic pests — called also true bug 2 : an unexpected defect, fault, flaw, or imperfection <the software was full of bugs> 3 a : a germ or microorganism especially when causing disease b : an unspecified or nonspecific sickness usually presumed due to a bug 4 : a sudden enthusiasm 5 : ENTHUSIAST <a camera bug> 6 : a prominent person 7 : a crazy person 8 : a concealed listening device 9 : a weight allowance given apprentice jockeys 1 a small insect. ■ informal a harmful microorganism, as a bacterium or virus. ■ an illness caused by such a microorganism: suffering from a flu bug ■ [with adjective] figurativ,e informal an enthusiastic, almost obsessive, interest in something: they caught the sailing bug | Joe was bitten by the showbiz bug. 2 (also true bug) Entomolgy an insect of a large order distinguished by having mouthparts that are modified for piercing and sucking. •Order Hemiptera: see HEMIPTERA. 3 a miniature microphone, typically concealed in a room or telephone, used for surveillance. 4 an error m a computer program or system.
  • 10. Lack of Ontological Consistency
  • 11. Lack of Ontological Consistency chamberpot bedpan
  • 12. Lack of Ontological Consistency a bedpan is a . . . MW11 CED RHUD ODE AHD Wik EWED MED CACD WU WN Cent vessel    toilet pan  receptacle   chamber pot  container    pan  utensil  a chamberpot is a . . . MW11 CED RHUD ODE AHD Wik EWED MED CACD WU WN Cent vessel     toilet pan receptacle  chamber pot container   pan bowl   
  • 13. Circularity (here, in generic terms) MW11 a is a kind of receptacle container container receptacle vessel container utensil (…) vessel CED a is a kind of receptacle object container object vessel object utensil (…) container RHUD a is a kind of receptacle container container (anything that contains) vessel utensil utensil (…) vessel ODE a is a kind of receptacle object container object vessel container utensil (…) container AHD a is a kind of receptacle container container receptacle vessel utensil utensil (…) container WordNet a is a kind of receptacle container container instrumentality vessel container utensil implement
  • 14. Sense Inventories Don't Reflect Usage " Yes. The stare. The laser look the jut-jawed coach shoots any UT miscreant who plays lackadaisically or stupidly. " I never really get past the eyes, advice to all of those doubting academic highbrows out there. To quote that animated miscreant Bart Simpson, " Don't have a cow, man! " This actually merely of wayfarers but of entire intellectual traditions. # The name of this huge miscreant is Critical Thinking - a name uttered by professors and students with more awe than , Winnipeg, Manitoba, Canada # A: American Express is not the only miscreant here. We have received several letters just like yours about credit-card companies. In ARS Western Regional Research Center in Albany, California. Conquering Caulerpa -- A Marine Miscreant # Sometimes referred to as " killer algae, " C. taxifolia flourishes in warm China's state-directed economy without a fully convertible currency while lambasting Japan as an economic miscreant. This downgrading of U.S.-Japan ties is particularly painful because it violates the highest virtue proliferating cells. It turns out that, in at least some cases, a miscreant protein traps p53, explains Princeton's Levine. P53 can't get anywhere near threat to the notions of causality which underlie our understanding of the universe. The miscreant tachyon velocities, Paul Birch proposed in 1984, may be ruled out by some the need for Western aid. # Incidentally, " The Ukraine " is a miscreant phrase from the days of the Czarist Empire. Ukraine is a recognized independent nation , " Martin wrote later, " which is a pit stop for the average miscreant. But when we passed his office, I realized that I wasn't average
  • 15. Human Knowledge Required! bedpan a necessary utensil for the use of persons confined to bed (Century) clipping something cut out or trimmed off, esp. an article from a newspaper (CED) draw vt extract (an object or liquid) from a container or receptacle (NOAD) hide to put someone or something in a place where they cannot be seen or found, or to put yourself somewhere where you cannot be seen or found (CDAE) mangle to spoil, injure, or make incoherent especially through ineptitude (MW11) restaurant a building where people go to eat (WordNet) spade 2. some implement, piece, or part resembling this (RHUD) drop 5b. mention in passing, typically in order to impress (ODE) bombshell 1. an unexpected and surprising event, especially an unpleasant one (ODE)
  • 16. A Core Problem: Lumping and Splitting ● Humans split lumpiness automatically (by discarding nonsense and impertinent information) ● Computers are largely clueless as to what is nonsense, and where logic limits lumpiness ● Splitty, very specific definitions are easier for machines to identify, however . . . ● They're irritating to humans, and take up much more space (and processing time)
  • 17. An alternative to MRDs: WMDs MWDs Machine-Written Dictionaries ● MRD “entries” can be supplemented with machine-harvested, machine-readable data ● Human-friendly lumpiness can be mitigated with the addition of disambiguating features ● Other inputs can support “human knowledge” and flesh out the implicit parts of definitions ● Corpus data can identify sense inventory gaps ● Many “gold standard” inputs are readily available and underexploited
  • 18. Just the Word: collocational data
  • 19. Just the Word ● URL: ● Data Owner: Sharp Laboratories ● Underlying Data: British National Corpus ● Main purpose: catalog of collocational patterns ● Possibly useful for: extraction of most frequent collocations and bigrams; some bigrams and triples not collected by Word Sketches (see next)
  • 21. Word Sketches ● URL: ● Data Licenser: Lexical Computing, Ltd. ● Underlying Data: numerous corpora ● Main purpose: aid to lexicographers and researchers ● Possibly useful for: statistical profiling of sense frequency; identification of idioms, phrasal verbs, compounds, and other “chunks”. Corpus Query Language allows for extensive flexibility in data retrieval.
  • 23.
  • 24. Oxford Sentence Dictionary ● URL: (?) ● Data Owner: Oxford University Press ● Underlying Data: Oxford English Corpus and World Wide Web ● Main purpose: collection of example sentences for ESL and other purposes ● Possibly useful for: Lesk-like approach to WSD; sense identification by pattern matching.
  • 26. FrameNet ● URL: ● Data Owner: UC Berkeley ● Underlying Data: BNC and other ● Main purposes: manifold ● Possibly useful for: complementary to Sketch Engine and OSD ● Further reading: “The Contribution of FrameNet to Practical Lexicography,” Atkins et al, IJL 16:3 (2003)
  • 28. Disambiguation of Collocations ● 90% of V+N and N+V collocations resolve to a single sense for each ● 10% of these represent multiple senses and require further context to disambiguate, e.g. bring case [V* obj N]: I didn't have enough evidence to bring the case to court. letter refer [N* subj V]: The letters 'c' and 'x' refer to the dilution factor used.
  • 29. Disambiguation of Collocations ● 90% of V+N and N+V collocations resolve to a single sense for each ● 10% of these represent multiple senses and require further context to disambiguate, e.g. bring case [V* obj N]: I didn't have enough evidence to bring the case to court. Her husband Ian brought a case of wine and a box of glasses. letter refer [N* subj V]: The letters 'c' and 'x' refer to the dilution factor used. Michael's letter refers very frequently to 'export-oriented consumed-productivity standards.'
  • 30. Disambiguation of Collocations ● URL: none (data is not currently online) ● Data Owner: University of Rome, La Sapienza (Roberto Navigli) ● Underlying Data: BNC and Just the Word; WordNet ● Main purpose: disambiguation of collocations ● Possibly useful for: a bigram dictionary (N+V and V+N only)
  • 32. The Century Dictionary ● URL: and ● Data Owner: public domain ● Underlying data: late 19th century English ● Main purpose: “a work of universal reference in all departments of knowledge” ● Possibly useful for: same!