CS 124/LINGUIST 180: From Languages to Information<br />Dan Jurafsky<br />Lecture 14: Information Extraction and Semantic ...
2<br />Background: Information Extraction<br />Extract information from text<br />Sometimes called text analyticscommercia...
3<br />Information Extraction<br />Creating knowledge bases and ontologies<br />Implications for cognitive modeling<br />D...
Outline<br />Reminder: Named Entity Tagging<br />Relation Extraction<br />Hand-built patterns<br />Seed (bootstrap) method...
What is “Information Extraction”<br />As a task:<br />Filling slots in a database from sub-segments of text.<br />October ...
What is “Information Extraction”<br />As a task:<br />Filling slots in a database from sub-segments of text.<br />October ...
What is “Information Extraction”<br />As a familyof techniques:<br />Information Extraction =<br />  segmentation + classi...
What is “Information Extraction”<br />As a familyof techniques:<br />Information Extraction =<br />  segmentation + classi...
What is “Information Extraction”<br />As a familyof techniques:<br />Information Extraction =<br />  segmentation + classi...
What is “Information Extraction”<br />TITLE   ORGANIZATION<br />NAME      <br />Bill Gates<br />CEO<br />Microsoft<br />Bi...
Extracting Structured Knowledge<br />Each article can contain hundreds or thousands of items of knowledge...<br />“The Law...
12<br />Goal:  Machine-readable summaries<br />Structured knowledge extraction:  Summary for machine<br />Textual abstract...
From Unstructured Text to Structured Knowledge<br />Unstructured Text<br />News articles...<br />slide from Rion Snow<br />
From Unstructured Text to Structured Knowledge<br />Unstructured Text<br />Blog posts....<br />slide from Rion Snow<br />
From Unstructured Text to Structured Knowledge<br />Unstructured Text<br />Scientific journal articles...<br />slide from ...
From Unstructured Text to Structured Knowledge<br />Unstructured Text<br />Tweets, instant messages, chat logs...<br />sli...
From Unstructured Text to Structured Knowledge<br />Unstructured Text<br />slide from Rion Snow<br />
From Unstructured Text to Structured Knowledge<br />Structured Knowledge<br />Unstructured Text<br />slide from Rion Snow<...
From Unstructured Text to Structured Knowledge<br />Structured Knowledge<br />Unstructured Text<br />slide from Rion Snow<...
From Unstructured Text to Structured Knowledge<br />Structured Knowledge<br />Unstructured Text<br />slide from Rion Snow<...
From Unstructured Text to Structured Knowledge<br />Structured Knowledge<br />Unstructured Text<br />slide from Rion Snow<...
From Unstructured Text to Structured Knowledge<br />Structured Knowledge<br />Unstructured Text<br />slide from Rion Snow<...
From Unstructured Text to Structured Knowledge<br />Structured Knowledge<br />Unstructured Text<br />slide from Rion Snow<...
Reminder: Task 1: Named Entity Tagging<br /><ul><li>General NER or Biomedical NER</li></ul>	<PER> John Hennessy</PER> is a...
Reminder:Maximum Entropy Markov Model<br />DNA<br />O<br />DNA<br />O<br />regulation<br />HIV−1<br />gene<br />of<br />Sl...
Task II: Relation Extraction<br />
Relations between words<br />Language Understanding Applications needs word meaning!<br />Question answering<br />Conversa...
Relation Prediction<br />“...works by such authors as Herrick, Goldsmith, and Shakespeare.”<br />“If you consider authors ...
Hyponymy<br />One sense is a hyponym of another if the first sense is more specific, denoting a subclass of the other<br /...
30<br />WordNet relations<br />X is-a-kind-of Y<br />(hyponym / hypernym)<br />X is-a-part-of Y<br />(meronym / holonym)<b...
WordNet is incomplete; ontological relations are missing for many words<br />This is especially true for specific domains ...
Other kinds of Relations: Disease Outbreaks<br />Extract structured information from text<br />Slide from Eugene Agichtein...
More relations: Protein Interactions<br />interact<br />complex<br />CBF-A            CBF-C<br />CBF-B  	          CBF-A-C...
Yet More Relations<br />CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 p...
Relation Types<br />For generic news texts...<br />Slide from Jim Martin<br />
More relations: UMLS<br />Unified Medical Language System<br />integrates linguistic, terminological and semantic informat...
Relations in Ontologies: GO (Gene Ontology)<br />GO (Gene Ontology)<br />Aligns descriptions of gene products in different...
Relations in Ontologies: geographical<br />Ontology<br />F-Logic<br />similar<br />Geographical Entity (GE)<br />is-a<br /...
39<br />MeSH (Medical Subject Headings) Thesaurus<br />Definition<br />MeSH Descriptor<br />Synonym set<br />Slide from Il...
MeSH Tree<br />MeSH Ontology<br />Hierarchically arranged from most general to most specific.<br />Actually a graph rather...
Slide from Doug Appelt<br />Types of ACE Relations, 2003<br />ROLE - relates a person to an organization or a geopolitical...
Frequent Freebase Relations<br />a<br />
Predicting the “is-a” relation<br />“...works by such authors as Herrick, Goldsmith, and Shakespeare.”<br />“If you consid...
Treatment<br />Disease<br />Why this is hard: Ambiguity!Which relations hold between 2 entities?<br />Cure?<br />Prevent?<...
Different relations between Disease (Hepatitis) and Treatment<br />Cure<br />These results suggest that con A-induced hepa...
5 easy methods for relation extraction<br />Hand-built patterns<br />Supervised methods<br />Bootstrapping (seed) methods<...
5 easy methods for relation extraction<br />Hand-built patterns<br />Bootstrapping (seed) methods<br />Supervised methods<...
A complex hand-built extraction rule [NYU Proteus]<br />
Goal:   Add hyponyms to WordNet directly from text.<br />Intuition from Hearst (1992) <br />“Agar is a substance prepared ...
Goal:   Add hyponyms to WordNet directly from text.<br />Intuition from Hearst (1992) <br />“Agar is a substance prepared ...
Hearst’s Hand-Designed Lexico-Syntactic Patterns<br />(Hearst, 1992):   Automatic Acquisition of Hyponyms<br />“Y such as ...
Hearst’s hand-built patterns for Relation Extraction<br />
Problem with hand-built patterns<br />Requires that we hand-build patterns for each relation!<br />don’t want to have to d...
5 easy methods for relation extraction<br />Hand-built patterns<br />Supervised methods<br />Bootstrapping (seed) methods<...
2. Supervised Relation Extraction<br />Sometimes done in 3 steps<br />Find all pairs of named entities<br />Decide if 2 en...
Relation Analysis<br />Usually just run on named entities within the same sentence<br />Slide from Jim Martin<br />
Slide from Jing Jiang<br />Relation Extraction<br />Task definition: to label the semantic relation between a pair of enti...
Supervised Learning<br />Supervised machine learning (e.g. [Zhou et al. 2005], [Bunescu & Mooney 2005], [Zhang et al. 2006...
ACE 2008 Six Relations<br />
Features: Words<br />Headwords of M1 and M2, and combination<br />George Washington Bridge<br />Bag of words and bigrams i...
Features: Named Entity Type and Mention Level<br />Named-entity types (ORG, LOC, etc)<br />Concatenation of the types<br /...
Features: Parse Tree and Base Phrases<br />Syntactic environment<br />Constituent path through the tree from one to the ot...
Features: Gazeteers and trigger words<br />Personal relative trigger list<br />from wordnet: parent, wife, husabnd, grandp...
American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagnersaid.<br />
Classifiers for supervised methods<br />Now you can use any classifier you like<br />SVM<br />Logistic regression<br />Naï...
Summary<br />Can get high accuracies with enough hand-labeled training data <br />If test data looks exactly like the trai...
5 easy methods for relation extraction<br />Hand-built patterns<br />Supervised methods<br />Bootstrapping (seed) methods<...
Slide from Jim Martin<br />Bootstrapping Approaches<br />What if you don’t have enough annotated text to train on.<br />Bu...
Slide from Jim Martin<br />Bootstrapping Example: Seed Tuple<br /><Mark Twain, Elmira>  Seed tuple<br />Grep (google)<br /...
Hearst (1992) proposal for bootstrapping<br />Choose lexical relation R.<br />Gather a set of pairs that have this relatio...
Slide from Jim Martin<br />Bootstrapping Relations<br />
Dipre (Brin 1998)<br />Extract <author, book> pairs.<br />Start with these 5 seeds<br />Learn these patterns:<br />Now ite...
Snowball (Agichtein and Gravano 2000)<br />New idea: require that X and Y be named entities of particular types<br />{<’s ...
5 easy methods for relation extraction<br />Hand-built patterns<br />Supervised methods<br />Bootstrapping (seed) methods<...
Distant supervision paradigm<br />Snow, Jurafsky, Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. ...
Distant supervision paradigm<br />Has advantages of supervised classification:<br />use of rich of hand-created knowledge<...
77<br />Relation Classification with “Distant Supervision”<br />We construct a noisy training set consisting of occurrence...
78<br />Relation Classification with “Distant Supervision”<br />We construct a noisy training set consisting of occurrence...
79<br />Relation Classification with “Distant Supervision”<br />We construct a noisy training set consisting of occurrence...
80<br />Relation Classification with “Distant Supervision”<br />We construct a noisy training set consisting of occurrence...
81<br />Relation Classification with “Distant Supervision”<br />We construct a noisy training set consisting of occurrence...
82<br />Relation Classification with “Distant Supervision”<br />We construct a noisy training set consisting of occurrence...
How to learn patterns<br />Snow, Jurafsky, Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. NIPS 17...
One of 70,000 patterns<br />“<superordinate> ‘called’ <subordinate>”<br />Learned from cases such as:<br />“sarcoma / canc...
Hypernym Precision / Recall for all Features<br />Slide from Rion Snow<br />
Hypernym Precision / Recall for all Features<br />Slide from Rion Snow<br />
Hypernym Precision / Recall for all Features<br />Slide from Rion Snow<br />
Hypernym Precision / Recall for all Features<br />Slide from Rion Snow<br />
Idea: use each pattern as a feature!!!!Precision/Recall for Hypernym Classification:<br />logistic regression<br />10-fold...
Outline<br />Reminder: Named Entity Tagging<br />Relation Extraction<br />Hand-built patterns<br />Seed (bootstrap) method...
Upcoming SlideShare
Loading in …5
×

Automatic Hypernym Classification: Towards the Induction of ...

440 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
440
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Automatic Hypernym Classification: Towards the Induction of ...

  1. 1. CS 124/LINGUIST 180: From Languages to Information<br />Dan Jurafsky<br />Lecture 14: Information Extraction and Semantic Relation learning<br />Lots of slides from many people, including Rion Snow, Jim Martin, Chris Manning, and William Cohen, <br />
  2. 2. 2<br />Background: Information Extraction<br />Extract information from text<br />Sometimes called text analyticscommercially<br />Extract entities (the people, organizations, locations, times, dates, genes, diseases, medicines, etc. in a text)<br />Extract the relations between entities<br />Figure out the larger events that are taking place<br />
  3. 3. 3<br />Information Extraction<br />Creating knowledge bases and ontologies<br />Implications for cognitive modeling<br />Digital Libaries<br />Google scholar, Citeseer need to extract the title, author and references<br />Bioinformatics<br />Patent analysis<br />Specific market segments for stock analysis<br />SEC filings<br />Intelligence analysis<br />
  4. 4. Outline<br />Reminder: Named Entity Tagging<br />Relation Extraction<br />Hand-built patterns<br />Seed (bootstrap) methods<br />Supervised classification<br />Distant supervision<br />
  5. 5. What is “Information Extraction”<br />As a task:<br />Filling slots in a database from sub-segments of text.<br />October 14, 2002, 4:00 a.m. PT<br />For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.<br />Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.<br />"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“<br />Richard Stallman, founder of the Free Software Foundation, countered saying…<br />NAME TITLE ORGANIZATION<br />Slide from William Cohen<br />
  6. 6. What is “Information Extraction”<br />As a task:<br />Filling slots in a database from sub-segments of text.<br />October 14, 2002, 4:00 a.m. PT<br />For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.<br />Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.<br />"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“<br />Richard Stallman, founder of the Free Software Foundation, countered saying…<br />IE<br />NAME TITLE ORGANIZATION<br />Bill GatesCEOMicrosoft<br />Bill VeghteVPMicrosoft<br />Richard StallmanfounderFree Soft..<br />Slide from William Cohen<br />
  7. 7. What is “Information Extraction”<br />As a familyof techniques:<br />Information Extraction =<br /> segmentation + classification + clustering + association<br />October 14, 2002, 4:00 a.m. PT<br />For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.<br />Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.<br />"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“<br />Richard Stallman, founder of the Free Software Foundation, countered saying…<br />Microsoft Corporation<br />CEO<br />Bill Gates<br />Microsoft<br />Gates<br />Microsoft<br />Bill Veghte<br />Microsoft<br />VP<br />Richard Stallman<br />founder<br />Free Software Foundation<br /> “named entity extraction”<br />Slide from William Cohen<br />
  8. 8. What is “Information Extraction”<br />As a familyof techniques:<br />Information Extraction =<br /> segmentation + classification + association + clustering<br />October 14, 2002, 4:00 a.m. PT<br />For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.<br />Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.<br />"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“<br />Richard Stallman, founder of the Free Software Foundation, countered saying…<br />Microsoft Corporation<br />CEO<br />Bill Gates<br />Microsoft<br />Gates<br />Microsoft<br />Bill Veghte<br />Microsoft<br />VP<br />Richard Stallman<br />founder<br />Free Software Foundation<br />Slide from William Cohen<br />
  9. 9. What is “Information Extraction”<br />As a familyof techniques:<br />Information Extraction =<br /> segmentation + classification+ association + clustering<br />October 14, 2002, 4:00 a.m. PT<br />For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.<br />Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.<br />"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“<br />Richard Stallman, founder of the Free Software Foundation, countered saying…<br />Microsoft Corporation<br />CEO<br />Bill Gates<br />Microsoft<br />Gates<br />Microsoft<br />Bill Veghte<br />Microsoft<br />VP<br />Richard Stallman<br />founder<br />Free Software Foundation<br />Slide from William Cohen<br />
  10. 10. What is “Information Extraction”<br />TITLE ORGANIZATION<br />NAME <br />Bill Gates<br />CEO<br />Microsoft<br />Bill <br />Veghte<br />VP<br />Microsoft<br />Free Soft..<br />Stallman<br />founder<br />Richard <br />As a familyof techniques:<br />Information Extraction =<br /> segmentation + classification+ association+ clustering<br />October 14, 2002, 4:00 a.m. PT<br />For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.<br />Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.<br />"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“<br />Richard Stallman, founder of the Free Software Foundation, countered saying…<br />Microsoft Corporation<br />CEO<br />Bill Gates<br />Microsoft<br />Gates<br />Microsoft<br />Bill Veghte<br />Microsoft<br />VP<br />Richard Stallman<br />founder<br />Free Software Foundation<br />*<br />*<br />*<br />*<br />Slide from William Cohen<br />
  11. 11. Extracting Structured Knowledge<br />Each article can contain hundreds or thousands of items of knowledge...<br />“The Lawrence Livermore National Laboratory (LLNL) in Livermore, California is a scientific research laboratory founded by the University of California in 1952.”<br />LLNL EQ Lawrence Livermore National Laboratory <br />LLNL LOC-IN California<br />Livermore LOC-IN California<br />LLNL IS-A scientific research laboratory<br />LLNL FOUNDED-BY University of California<br />LLNL FOUNDED-IN 1952<br />
  12. 12. 12<br />Goal: Machine-readable summaries<br />Structured knowledge extraction: Summary for machine<br />Textual abstract: <br />Summary for human<br />
  13. 13. From Unstructured Text to Structured Knowledge<br />Unstructured Text<br />News articles...<br />slide from Rion Snow<br />
  14. 14. From Unstructured Text to Structured Knowledge<br />Unstructured Text<br />Blog posts....<br />slide from Rion Snow<br />
  15. 15. From Unstructured Text to Structured Knowledge<br />Unstructured Text<br />Scientific journal articles...<br />slide from Rion Snow<br />
  16. 16. From Unstructured Text to Structured Knowledge<br />Unstructured Text<br />Tweets, instant messages, chat logs...<br />slide from Rion Snow<br />
  17. 17. From Unstructured Text to Structured Knowledge<br />Unstructured Text<br />slide from Rion Snow<br />
  18. 18. From Unstructured Text to Structured Knowledge<br />Structured Knowledge<br />Unstructured Text<br />slide from Rion Snow<br />
  19. 19. From Unstructured Text to Structured Knowledge<br />Structured Knowledge<br />Unstructured Text<br />slide from Rion Snow<br />
  20. 20. From Unstructured Text to Structured Knowledge<br />Structured Knowledge<br />Unstructured Text<br />slide from Rion Snow<br />
  21. 21. From Unstructured Text to Structured Knowledge<br />Structured Knowledge<br />Unstructured Text<br />slide from Rion Snow<br />
  22. 22. From Unstructured Text to Structured Knowledge<br />Structured Knowledge<br />Unstructured Text<br />slide from Rion Snow<br />
  23. 23. From Unstructured Text to Structured Knowledge<br />Structured Knowledge<br />Unstructured Text<br />slide from Rion Snow<br />
  24. 24. Reminder: Task 1: Named Entity Tagging<br /><ul><li>General NER or Biomedical NER</li></ul> <PER> John Hennessy</PER> is a professor at <ORG> Stanford University </ORG>, in <LOC> Palo Alto </LOC>.<br /><RNA> TAR </RNA> independent transactivation by <PROTEIN> Tat </PROTEIN>in cells derived from the <CELL> CNS </CELL>- a novel mechanism of <DNA> HIV-1 gene </DNA>regulation.<br />Slide from Chris Manning<br />
  25. 25. Reminder:Maximum Entropy Markov Model<br />DNA<br />O<br />DNA<br />O<br />regulation<br />HIV−1<br />gene<br />of<br />Slide from Chris Manning<br />
  26. 26. Task II: Relation Extraction<br />
  27. 27. Relations between words<br />Language Understanding Applications needs word meaning!<br />Question answering<br />Conversational agents<br />Summarization<br />One key meaning component: word relations<br />Hierarchical (ontological) relations<br />“San Francisco” ISA “city”<br />Other relations between words <br />“alternator” is a part of a “car”<br />
  28. 28. Relation Prediction<br />“...works by such authors as Herrick, Goldsmith, and Shakespeare.”<br />“If you consider authors like Shakespeare...”<br />“Some authors (including Shakespeare)...”<br />“Shakespeare was the author of several...”<br />“Shakespeare, author of The Tempest...”<br />ShakespeareIS-A author(0.87)<br />How can we capture the variability of expression of a relation in natural text from a large, unannotated corpus?<br />
  29. 29. Hyponymy<br />One sense is a hyponym of another if the first sense is more specific, denoting a subclass of the other<br />car is a hyponym of vehicle<br />dog is a hyponym of animal<br />mango is a hyponym of fruit<br />Conversely<br />vehicle is a hypernym/superordinate of car<br />animal is a hypernym of dog<br />fruit is a hypernym of mango<br />
  30. 30. 30<br />WordNet relations<br />X is-a-kind-of Y<br />(hyponym / hypernym)<br />X is-a-part-of Y<br />(meronym / holonym)<br />slide from Rion Snow<br />
  31. 31. WordNet is incomplete; ontological relations are missing for many words<br />This is especially true for specific domains (restaurants, auto parts, finance)<br />
  32. 32. Other kinds of Relations: Disease Outbreaks<br />Extract structured information from text<br />Slide from Eugene Agichtein<br />May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis… <br />Disease Outbreaks in The New York Times<br />Information Extraction System (e.g., NYU’s Proteus)<br />
  33. 33. More relations: Protein Interactions<br />interact<br />complex<br />CBF-A CBF-C<br />CBF-B CBF-A-CBF-C complex<br />associates<br />„We show that CBF-A and CBF-C interact <br />with each other to form a CBF-A-CBF-C complex<br />and that CBF-B does not interact with CBF-A or <br />CBF-C individually but that it associates with the <br />CBF-A-CBF-C complex.“<br />Slide from Rosario and Hearst<br />
  34. 34. Yet More Relations<br />CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York<br />Slide from Jim Martin<br />
  35. 35. Relation Types<br />For generic news texts...<br />Slide from Jim Martin<br />
  36. 36. More relations: UMLS<br />Unified Medical Language System<br />integrates linguistic, terminological and semantic information<br />Semantic Network consists of 134 semantic types and 54 relations between types<br />Pharmacologic Substance affects Pathologic Function<br />Pharmacologic Substance causes Pathologic Function<br />Pharmacologic Substance complicates Pathologic Function<br />Pharmacologic Substance diagnoses Pathologic Function<br />Pharmacologic Substance prevents Pathologic Function<br />Pharmacologic Substance treats Pathologic Function<br />Slide from Paul Buitelaar<br />
  37. 37. Relations in Ontologies: GO (Gene Ontology)<br />GO (Gene Ontology)<br />Aligns descriptions of gene products in different databases, including plant, animal and microbial genomes<br />Organizing principles are molecular function, biological process and cellular component<br />Accession: GO:0009292<br />Ontology: biological process<br />Synonyms: broad: genetic exchange<br />Definition: In the absence of a sexual life cycle, the processes involved in the introduction of genetic information to create a genetically different individual.<br />Term Lineage all : all (164142)<br /> GO:0008150 : biological process (115947)<br /> GO:0007275 : development (11892)<br /> GO:0009292 : genetic transfer (69)<br />Slide from Paul Buitelaar<br />
  38. 38. Relations in Ontologies: geographical<br />Ontology<br />F-Logic<br />similar<br />Geographical Entity (GE)<br />is-a<br />flow_through<br />Inhabited GE<br />Natural GE<br />capital_of<br />city<br />river<br />mountain<br />country<br />instance_of<br />located_in<br />Neckar<br />Zugspitze<br />Germany<br />capital_of<br />length (km)<br />height (m)<br />flow_through<br />located_in<br />Berlin<br />Stuttgart<br />367<br />2962<br />flow_through<br />Design: Philipp Cimiano<br />Slide from Paul Buitelaar<br />
  39. 39. 39<br />MeSH (Medical Subject Headings) Thesaurus<br />Definition<br />MeSH Descriptor<br />Synonym set<br />Slide from Illhoi Yoo, Xiaohua (Tony) Hu,and Il-Yeol Song<br />
  40. 40. MeSH Tree<br />MeSH Ontology<br />Hierarchically arranged from most general to most specific.<br />Actually a graph rather than a tree<br />normally appear in more than one place in the tree<br />MeSH Tree<br />Slide from Illhoi Yoo, Xiaohua (Tony) Hu,and Il-Yeol Song<br />
  41. 41. Slide from Doug Appelt<br />Types of ACE Relations, 2003<br />ROLE - relates a person to an organization or a geopolitical entity<br />Subtypes: member, owner, affiliate, client, citizen<br />PART - generalized containment<br />Subtypes: subsidiary, physical part-of, set membership<br />AT - permanent and transient locations<br />Subtypes: located, based-in, residence<br />SOCIAL- social relations among persons<br />Subtypes: parent, sibling, spouse, grandparent, associate<br />
  42. 42. Frequent Freebase Relations<br />a<br />
  43. 43. Predicting the “is-a” relation<br />“...works by such authors as Herrick, Goldsmith, and Shakespeare.”<br />“If you consider authors like Shakespeare...”<br />“Some authors (including Shakespeare)...”<br />“Shakespeare was the author of several...”<br />“Shakespeare, author of The Tempest...”<br />ShakespeareIS-A author(0.87)<br />How can we capture the variability of expression of a relation in natural text from a large, unannotated corpus?<br />
  44. 44. Treatment<br />Disease<br />Why this is hard: Ambiguity!Which relations hold between 2 entities?<br />Cure?<br />Prevent?<br />Side Effect?<br />
  45. 45. Different relations between Disease (Hepatitis) and Treatment<br />Cure<br />These results suggest that con A-induced hepatitis was ameliorated by pretreatment with TJ-135.<br />Prevent<br />A two-dose combined hepatitis A and Bvaccine would facilitate immunization programs<br />Vague<br />Effect of interferon on hepatitis B<br />Slide from Barbara Rosario and Marti Hearst<br />
  46. 46. 5 easy methods for relation extraction<br />Hand-built patterns<br />Supervised methods<br />Bootstrapping (seed) methods<br />Unsupervised methods<br />Distant supervision<br />
  47. 47. 5 easy methods for relation extraction<br />Hand-built patterns<br />Bootstrapping (seed) methods<br />Supervised methods<br />Unsupervised methods<br />Distant supervision<br />
  48. 48. A complex hand-built extraction rule [NYU Proteus]<br />
  49. 49. Goal: Add hyponyms to WordNet directly from text.<br />Intuition from Hearst (1992) <br />“Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use”<br />What does Gelidium mean? <br />How do you know?`<br />
  50. 50. Goal: Add hyponyms to WordNet directly from text.<br />Intuition from Hearst (1992) <br />“Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use”<br />What does Gelidium mean? <br />How do you know?`<br />
  51. 51. Hearst’s Hand-Designed Lexico-Syntactic Patterns<br />(Hearst, 1992): Automatic Acquisition of Hyponyms<br />“Y such as X ((, X)* (, and/or) X)”<br />“such Y as X…”<br />“X… or other Y”<br />“X… and other Y”<br />“Y including X…”<br />“Y, especially X…”<br />
  52. 52. Hearst’s hand-built patterns for Relation Extraction<br />
  53. 53. Problem with hand-built patterns<br />Requires that we hand-build patterns for each relation!<br />don’t want to have to do this for all possible relations!<br />we’d like better accuracy<br />
  54. 54. 5 easy methods for relation extraction<br />Hand-built patterns<br />Supervised methods<br />Bootstrapping (seed) methods<br />Unsupervised methods<br />Distant supervision<br />
  55. 55. 2. Supervised Relation Extraction<br />Sometimes done in 3 steps<br />Find all pairs of named entities<br />Decide if 2 entities are related<br />If yes, classifying the relation<br />Why the extra step?<br />Cuts down on training time for classification by eliminating most pairs<br />Producing separate feature-sets that are appropriate for each task.<br />55<br />
  56. 56. Relation Analysis<br />Usually just run on named entities within the same sentence<br />Slide from Jim Martin<br />
  57. 57. Slide from Jing Jiang<br />Relation Extraction<br />Task definition: to label the semantic relation between a pair of entities in a sentence (fragment)<br />…[leaderarg-1] of a minority [governmentarg-2]…<br />PHYS<br />PER-SOC<br />EMP-ORG<br />NIL<br />PHYS: Physical<br />PER-SOC: Personal / Social<br />EMP-ORG: Employment / Membership / Subsidiary<br />
  58. 58. Supervised Learning<br />Supervised machine learning (e.g. [Zhou et al. 2005], [Bunescu & Mooney 2005], [Zhang et al. 2006], [Surdeanu & Ciaramita 2007])<br />Training data is needed for each relation type<br />…[leaderarg-1] of a minority [governmentarg-2]…<br />arg-1 word: leader<br />arg-2 type: ORG<br />dependency:<br />arg-1  of  arg-2<br />EMP-ORG<br />PHYS<br />PER-SOC<br />NIL<br />Slide from Jing Jiang<br />
  59. 59. ACE 2008 Six Relations<br />
  60. 60. Features: Words<br />Headwords of M1 and M2, and combination<br />George Washington Bridge<br />Bag of words and bigrams in M1 and M2<br />Words or bigrams in particular positions to the left and right of the M1 and M2<br />+/- 1, 2, 3<br />Bag of words or bigrams between the two entities<br />
  61. 61. Features: Named Entity Type and Mention Level<br />Named-entity types (ORG, LOC, etc)<br />Concatenation of the types<br />Entity Level of M1 and M2 <br />(NAME, NOMINAL, PRONOUN)<br />
  62. 62. Features: Parse Tree and Base Phrases<br />Syntactic environment<br />Constituent path through the tree from one to the other<br />Base syntactic chunk sequence from one to the other<br />Dependency path<br />Slide from Jim Martin<br />
  63. 63. Features: Gazeteers and trigger words<br />Personal relative trigger list<br />from wordnet: parent, wife, husabnd, grandparent, etc<br />Country name list<br />
  64. 64. American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagnersaid.<br />
  65. 65. Classifiers for supervised methods<br />Now you can use any classifier you like<br />SVM<br />Logistic regression<br />Naïve Bayes<br />etc<br />
  66. 66. Summary<br />Can get high accuracies with enough hand-labeled training data <br />If test data looks exactly like the training data<br />But<br />labeling 5000 relations (and named entities) is expensive<br />the approach doesn’t generalize to different genres<br />
  67. 67. 5 easy methods for relation extraction<br />Hand-built patterns<br />Supervised methods<br />Bootstrapping (seed) methods<br />Unsupervised methods<br />Distant supervision<br />
  68. 68. Slide from Jim Martin<br />Bootstrapping Approaches<br />What if you don’t have enough annotated text to train on.<br />But you might have some seed tuples <br />Or you might have some patterns that work pretty well<br />Can you use those seeds to do something useful?<br />Co-training and active learning use the seeds to train classifiers to tag more data to train better classifiers...<br />Bootstrapping tries to learn directly (populate a relation) through direct use of the seeds<br />
  69. 69. Slide from Jim Martin<br />Bootstrapping Example: Seed Tuple<br /><Mark Twain, Elmira> Seed tuple<br />Grep (google)<br />“Mark Twain is buried in Elmira, NY.”<br />X is buried in Y<br />“The grave of Mark Twain is in Elmira”<br />The grave of X is in Y<br />“Elmira is Mark Twain’s final resting place”<br />Y is X’s final resting place.<br />Use those patterns to grep for new tuples that you don’t already know<br />
  70. 70. Hearst (1992) proposal for bootstrapping<br />Choose lexical relation R.<br />Gather a set of pairs that have this relation<br />Find places in the corpus where these expressions occur near each other and record the environment.<br />Find the commonalities among these environments and hypothesize that common ones yield patterns that indicate the relation of interest.<br />
  71. 71. Slide from Jim Martin<br />Bootstrapping Relations<br />
  72. 72. Dipre (Brin 1998)<br />Extract <author, book> pairs.<br />Start with these 5 seeds<br />Learn these patterns:<br />Now iterate, using these patterns to get more instances and patterns…<br />
  73. 73. Snowball (Agichtein and Gravano 2000)<br />New idea: require that X and Y be named entities of particular types<br />{<’s 0.7> <in 0.7> <headquarters 0.7>}<br />LOCATION <br />ORGANIZATION<br />{<- 0.75> <based 0.75>}<br />LOCATION <br />ORGANIZATION<br />
  74. 74. 5 easy methods for relation extraction<br />Hand-built patterns<br />Supervised methods<br />Bootstrapping (seed) methods<br />Unsupervised methods<br />Distant supervision<br />
  75. 75. Distant supervision paradigm<br />Snow, Jurafsky, Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. NIPS 17<br />Mintz, Bills, Snow, Jurafsky (2009) Distant supervision for relation extraction without labeled data. ACL-2009.<br />Instead of hand-creating 5 seed examples<br />Use a large database to get our seed examples<br />lots of examples<br />supervision from a database, not a corpus!<br />Not genre-dependent!<br />Create lots and lots of noisy features from all these examples<br />Combine in a classifier<br />
  76. 76. Distant supervision paradigm<br />Has advantages of supervised classification:<br />use of rich of hand-created knowledge<br />Has advantages of unsupervised classification:<br />infinite amounts of data<br />allows for very large number of weak features<br />not sensitive to training corpus<br />
  77. 77. 77<br />Relation Classification with “Distant Supervision”<br />We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet.<br />Slide from Rion Snow<br />
  78. 78. 78<br />Relation Classification with “Distant Supervision”<br />We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet.<br />Slide from Rion Snow<br />
  79. 79. 79<br />Relation Classification with “Distant Supervision”<br />We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet.<br />Slide from Rion Snow<br />
  80. 80. 80<br />Relation Classification with “Distant Supervision”<br />We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet.<br />Slide from Rion Snow<br />
  81. 81. 81<br />Relation Classification with “Distant Supervision”<br />We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet.<br />This leads to high-signal examples like:<br />“...consider authors like Shakespeare...”<br />“Some authors (including Shakespeare)...”<br />“Shakespeare was the author of several...”<br />“Shakespeare, author of The Tempest...”<br />Slide from Rion Snow<br />
  82. 82. 82<br />Relation Classification with “Distant Supervision”<br />We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet.<br />This leads to high-signal examples like:<br />“...consider authors like Shakespeare...”<br />“Some authors (including Shakespeare)...”<br />“Shakespeare was the author of several...”<br />“Shakespeare, author of The Tempest...”<br />But noisy examples like:<br />“The author of Shakespeare in Love...”<br />“...authors at the ShakespeareFestival...”<br />Training set (TREC and Wikipedia):<br />14,000 hypernym pairs, ~600,000 total pairs<br />Slide from Rion Snow<br />
  83. 83. How to learn patterns<br />Snow, Jurafsky, Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. NIPS 17<br />… doubly heavy hydrogen atomcalleddeuterium…<br />Take corpus sentences<br />Collect noun pairs<br />752,311 pairs from 6M words of newswire<br />Is pair an IS-A in WordNet? <br />14,387 yes, 737,924 no<br />Parse the sentences<br />Extract patterns<br />69,592 dependency paths >5 pairs)<br />Train classifier on these patterns<br />Logistic regression with 70K features(actually converted to 974,288 bucketed binary features)<br />1<br />(Atom, deuterium)<br />2<br />YES<br />3<br />4<br />5<br />6<br />
  84. 84. One of 70,000 patterns<br />“<superordinate> ‘called’ <subordinate>”<br />Learned from cases such as:<br />“sarcoma / cancer”: …an uncommon bone cancercalled osteogenicsarcoma and to…<br />“deuterium / atom” ….heavy water rich in the doubly heavy hydrogen atomcalleddeuterium.<br />New pairs discovered: <br />“efflorescence / condition”: …and a conditioncalledefflorescence are other reasons for… <br />“’neal_inc / company” …The company, now called O'Neal Inc., was sole distributor of E-Ferol…<br />“hat_creek_outfit / ranch” …run a small ranch called the Hat Creek Outfit.<br />“hiv-1 / aids_virus” …infected by the AIDS virus, called HIV-1.<br />“bateau_mouche / attraction” …local sightseeing attraction called the Bateau Mouche...<br />“kibbutz_malkiyya / collective_farm” …an Israeli collective farm called Kibbutz Malkiyya…<br />
  85. 85. Hypernym Precision / Recall for all Features<br />Slide from Rion Snow<br />
  86. 86. Hypernym Precision / Recall for all Features<br />Slide from Rion Snow<br />
  87. 87. Hypernym Precision / Recall for all Features<br />Slide from Rion Snow<br />
  88. 88. Hypernym Precision / Recall for all Features<br />Slide from Rion Snow<br />
  89. 89. Idea: use each pattern as a feature!!!!Precision/Recall for Hypernym Classification:<br />logistic regression<br />10-fold Cross Validation on 14,000 WordNet-Labeled Pairs<br />slide from Rion Snow<br />
  90. 90. Outline<br />Reminder: Named Entity Tagging<br />Relation Extraction<br />Hand-built patterns<br />Seed (bootstrap) methods<br />Supervised classification<br />Distant supervision<br />

×