Tom Finin: “From Strings to Things: Populating Knowledge Bases from Text”

243 views

Published on

Tim Finin, Professor of Computer Science and Electrical Engineering at the University of Maryland, Baltimore County (UMBC) gave a wonderful presentation entitled “From Strings to Things: Populating Knowledge Bases from Text,” as part of our Cognitive Systems Institute Speaker Series.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
243
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Tom Finin: “From Strings to Things: Populating Knowledge Bases from Text”

  1. 1. From Strings to Things: Popula4ng Knowledge Bases from Text Tim Finin University of Maryland, Bal6more County IBM Cogni6ve Systems Ins6tute Speaker Series July 14, 2016 h"p://ebiq.org/r/369
  2. 2. The Web is our greatest knowledge source
  3. 3. But it was designed for people, not machines
  4. 4. But it was designed for people, not machines •  Its content is mostly text, spoken language, images and videos •  These are easy for people to understand •  But hard for machines Machines need access to this knowledge too
  5. 5. We need to add knowledge graphs
  6. 6. We need to add knowledge graphs •  High quality semi-structured informa6on about en66es and rela6ons •  Represented and accessed via Web standards •  Easily integrated, fused and reasoned with
  7. 7. Who wrote Tom Sawyer?
  8. 8. Almost all commercial recipe sites embed seman4c data about the recipes in an RDF-compa6ble form using terms from the schema.org ontology. Search engines read and use this data to beRer under- stand the seman6cs of the page content
  9. 9. Where does Knowledge come from •  Exis6ng knowledge graphs quickly developed with semi-structured data from Wikipedia, augmented with –  Structured databases (Geonames) –  Semi-structured data collec6ons (CIA Factbook) •  Maintaining and extending them and crea6ng focused knowledge graphs will increasingly require extrac4ng informa4on from text
  10. 10. NIST Text Analysis Conference Yearly evalua6on workshops on natural language processing and related applica6ons with large test collec6ons and common evalua6on procedures TAC Tracks 08 09 10 11 12 13 14 15 Ques6on Answering X Textual Entailment X X X X Summariza6on X X X X X Knowledge Base Popula6on X X X X X X X En6ty Discovery & Linking X X Event Detec6on X X Valida6on X hRp://www.nist.gov/tac
  11. 11. Knowledge Base Popula4on Tracks focused on building a knowledge base from en66es and rela6ons extracted from text •  Cold Start KBP: construct a KB from text •  En4ty discovery & linking: cluster and link en6ty men6ons •  Slot filling •  Slot filler valida6on •  Sen6ment •  Events: discover and cluster events in text
  12. 12. Knowledge Base Popula4on* Build a system that can •  Read 90K documents: newswire ar6cles & social media posts in English, Chinese and Spanish •  Find en6ty men6ons, types and rela6ons •  Cluster en66es within and across documents and link to a reference KB when appropriate •  Remove errors (Obama born in Illinois), draw sound inferences (Malia and Sasha sisters) •  Generate triples with provenance data (Doc + string offsets) for men6ons and rela6ons * 2016
  13. 13. Knowledge Base Popula4on* Build a system that can •  Read 90K documents: newswire ar6cles & social media posts in English, Chinese and Spanish •  Find en6ty men6ons, types and rela6ons •  Cluster en66es within and across documents and link to a reference KB when appropriate •  Remove errors (Obama born in Illinois), draw sound inferences (Malia and Sasha sisters) •  Generate triples with provenance data (Doc + string offsets) for men6ons and rela6ons * 2016 <DOC id="APW_ENG_20100325.0021" type="story" > <HEADLINE> Divorce aRorney says Dennis Hopper is dying </HEADLINE> <DATELINE> LOS ANGELES 2010-03-25 00:15:51 UTC </DATELINE> <TEXT <P> Dennis Hopper's divorce aRorney says in a court filing that the actor is dying and can't undergo chemotherapy as he baRles prostate cancer. </P> <P> ARorney Joseph Mannis described the "Easy Rider" star's grave condi6on in a declara6on filed Wednesday in Los Angeles Superior Court. </P> <P> Mannis and aRorneys for Hopper's wife Victoria are figh6ng over when and whether to take the actor's deposi6on. </P> …
  14. 14. Knowledge Base Popula4on* Build a system that can •  Read 90K documents: newswire ar6cles & social media posts in English, Chinese and Spanish •  Find en6ty men6ons, types and rela6ons •  Cluster en66es within and across documents and link to a reference KB when appropriate •  Remove errors (Obama born in Illinois), draw sound inferences (Malia and Sasha sisters) •  Generate triples with provenance data (Doc + string offsets) for men6ons and rela6ons * 2016 <DOC id="APW_ENG_20100325.0021" type="story" > <HEADLINE> Divorce aRorney says Dennis Hopper is dying </HEADLINE> <DATELINE> LOS ANGELES 2010-03-25 00:15:51 UTC </DATELINE> <TEXT <P> Dennis Hopper's divorce aRorney says in a court filing that the actor is dying and can't undergo chemotherapy as he baRles prostate cancer. </P> <P> ARorney Joseph Mannis described the "Easy Rider" star's grave condi6on in a declara6on filed Wednesday in Los Angeles Superior Court. </P> <P> Mannis and aRorneys for Hopper's wife Victoria are figh6ng over when and whether to take the actor's deposi6on. </P> … … :e00211 type PER :e00211 link FB:m.02fn5 :e00211 link WIKI:Dennis_Hopper :e00211 mention "Dennis Hopper" APW_021:185-197 :e00211 mention "Hopper" APW_021:507-512 :e00211 mention "Hopper" APW_021:618-623 :e00211 mention "丹尼斯·霍珀” CMN_011:930-936 :e00211 per:spouse :e00217 APW_021:521-528 :e00217 per:spouse :e00211 APW_021:521-528 :e00211 per:age   "72" APW_021:521-528 …
  15. 15. Knowledge Base Popula4on* Build a system that can •  Read 90K documents: newswire ar6cles & social media posts in English, Chinese and Spanish •  Find en6ty men6ons, types and rela6ons •  Cluster en66es within and across documents and link to a reference KB when appropriate •  Remove errors (Obama born in Illinois), draw sound inferences (Malia and Sasha sisters) •  Generate triples with provenance data (Doc + string offsets) for men6ons and rela6ons * 2016 <DOC id="APW_ENG_20100325.0021" type="story" > <HEADLINE> Divorce aRorney says Dennis Hopper is dying </HEADLINE> <DATELINE> LOS ANGELES 2010-03-25 00:15:51 UTC </DATELINE> <TEXT <P> Dennis Hopper's divorce aRorney says in a court filing that the actor is dying and can't undergo chemotherapy as he baRles prostate cancer. </P> <P> ARorney Joseph Mannis described the "Easy Rider" star's grave condi6on in a declara6on filed Wednesday in Los Angeles Superior Court. </P> <P> Mannis and aRorneys for Hopper's wife Victoria are figh6ng over when and whether to take the actor's deposi6on. </P> … … :e00211 type PER :e00211 link FB:m.02fn5 :e00211 link WIKI:Dennis_Hopper :e00211 mention "Dennis Hopper" APW_021:185-197 :e00211 mention "Hopper" APW_021:507-512 :e00211 mention "Hopper" APW_021:618-623 :e00211 mention "丹尼斯·霍珀” CMN_011:930-936 :e00211 per:spouse :e00217 APW_021:521-528 :e00217 per:spouse :e00211 APW_021:521-528 :e00211 per:age   "72" APW_021:521-528 …
  16. 16. KB Evalua4on Methodology •  Evalua6ng KBs extracted from 90K documents is non-trivial •  TAC’s approach is simplified by: –  Fixing the ontology of en6ty types and rela6ons – Specifying a serializa6on as triples + provenance – Sampling a KB using a set of queries grounded in an en#ty men#on found in a document •  Given a KB, we can evaluate its precision and recall for a set of queries
  17. 17. KB Evalua4on Methodology •  EX: What are the names of schools aRended by the children of the en6ty men6oned in document #45611 at characters 401-412 – That men6on is George Bush and the document context suggests it refers to the 41st U.S. president – Query given in structured form using TAC ontology •  Assessors determine good answers given text corpus and check submission results using provenance as needed – Yale, Harvard, UT Aus6n, Tulane, Univ. of Virginia, Boston College, ...
  18. 18. TAC Ontology •  Five basic en6ty types – PER: people (John Lennon) or groups (Americans) – ORG: organiza4ons like IBM, MIT or US Senate – GPE: geopoli4cal en6ty like Boston, Belgium or Europe – LOC: loca4ons like Central Park or the Rockies – FAC: facili4es like BWI or the Empire State Building •  En6ty Men6ons – Strings referencing en66es by name (Barack Obama) or descrip6on (the President) •  ~65 rela6ons – Rela6ons hold between two en66es: parent_of, spouse, employer, founded_by, city_of_birth, … – Or between an en6ty & string: age, website, 6tle, cause_of_death, ...
  19. 19. Kelvin •  KELVIN: Knowledge Extrac6on, Linking, Valida6on and Inference •  Informa6on extrac6on system developed at the Human Language Technology Center of Excellence at JHU •  Used in the TAC KBP tracks in 2012-2016 and various projects •  Documents in, triples out (with provenance) h"p://ebiq.org/p/730
  20. 20. Kelvin’s basic pipeline 1. Document level analysis for en6ty & rela6on extrac6on 2. Cross-document co-reference to create ini6al KB 3. KB linking, inference and adjudica6on 4. Materialize the KB
  21. 21. Document-level processing • Done in parallel on a grid-based system • Extract named en66es, men6ons, events, rela6ons, values using several tools (BBN Serif, Stanford CoreNLP, custom systems) • Map results to TAC ontology • Resolve inconsistencies in en6ty men6on chains • Produces “single document” KBs with TAC compliant triples and provenance data • CS15: 50K docs, 1.9M men4ons, 800K en44es, 2M facts 1
  22. 22. X-doc Coref: crea4ng ini4al KB • Cross-document co-reference creates an ini6al KB from the collec6on of single-document KBs - Iden6fy that Barack Obama en6ty in DOC32 is same individual as Obama in DOC342, etc. •  Uses agglomera#ve clustering on similarity of en4ty men4ons & co-men4oned en44es - Both docs also men6on White House, Biden, D.C. - Various string similarity measures, name frequency •  Works well on English, Spanish, Chinese 2
  23. 23. Kripke • Kelvin clustered ~1M document en##es into ~300K KB en##es • Conserva6ve, tending to under- rather than over-merge • Typical paRern is one large cluster and a long tail of small ones • Barack Obama had one cluster of 8K doc. en66es and ~100 small ones 0 2000 4000 6000 8000 10000 72 singleton Obama clusters + 15 with just two Huge 8k cluster 2
  24. 24. Inference and adjudica4on Inference and adjudica6on rules are used to • Delete rela6ons viola6ng KB constraints –  A person can’t be born in an organiza#on • Add rela6ons from sound inference rules – Two people sharing a parent are siblings – Born in place P1, P1 part_of P2 => Born in P2 • Merge en66es using rules – Merge ci#es in same Geo. region with same names • Add rela6ons by default or heuris6c rules –  A person is probably a ci#zen of their country of birth 3
  25. 25. En4ty Linking • Simple strategy to link text en66es to reference KB • TAC uses last Freebase KB, our subset has - ~4.5M en66es and ~150M triples - Names and text in English, Spanish and Chinese • En6ty linking compares en6ty men#ons with Freebase en6ty names & aliases, weighted by – How oyen each men6on used for en6ty – En6ty significance (log of inlinks (1..20)) • Results: set of candidates and a score for each – Reject candidates if too many or scores low 3
  26. 26. KB-level merging rules •  Merge ci6es with same name in same state •  Highly discrimina6ve rela6ons give evidence of sameness –  per:spouse is few to few – org:top_level_emp is few to few •  Merge PERs with similar names who were –  both CEOs of the same company, or – Both married to the same person •  Merge en66es linked to same Freebase en6ty 3
  27. 27. Slot Value Consolida4on •  Problem: too many values for some slots, especially for ‘popular’ en66es, e.g. – An en6ty with four different per:age values – Obama has >100 per:employee_of values •  Strategy: rank values and select best – Rank aRested values by # of aRes6ng docs – Rank inferred values below aRested values – Select best value for single-values slot – Select best N values for mul6-valued slots 3
  28. 28. Inference in Cold Start • In CS we can only use sound inference rules, e.g. – People are siblings if they share a parent – People born in city X were born in region Y if X is in Y • Default and heuris6c rules are not aRested – You probably resided in a city were you went to school • 2014 base run found 848 aRested sibling rela6ons, inference added 132, 16% more • Drawing sound inferences from {good|bad} facts yields {good|bad} results 3 ⊨
  29. 29. Materialize KB versions Some work is needed to encode our nascent KB in an actual KB system or serializa6on • TAC require custom serializa6on with specific requirements, e.g., at most four provenance strings of length ≤ 150 characters – Pick best for slots with many provenance strings • RDF requires an OWL schema and using reifica6on for metadata, e.g., men6on and rela6on provenance We use RDF for subsequent use 4
  30. 30. 2015 TAC Results •  TAC ColdStart Knowledge Base Popula6on – Eight teams par6cipated in 2015 Cold Start KBP – We placed 2nd and 3rd depending on the metric •  TAC En6ty Discovery and linking •  Six teams submiRed runs for all three languages, 8 for ENG, 7 for CMN and 7 for SPA •  We placed third for all languages, second for Chinese and third for English
  31. 31. Lessons Learned •  We always have to mind precision & recall •  Extrac6ng informa6on from text is inherently noisy; reading more text helps both •  Using machine learning at every level is important •  Making more use of probabili6es will help •  Extrac6ng informa6on about a events is hard •  Recognizing the temporal extent of rela6ons is important, but s6ll a challenge
  32. 32. Conclusion •  KBs help in extrac6ng informa6on from text •  The informa6on extracted can update the KBs •  The KBs provide support for new tasks, such as ques6on answering and speech interfaces •  We’ll see this approach grow and evolve in the future •  New machine learning frameworks will result in beRer accuracy
  33. 33. For more informa4on, contact finin@umbc.edu

×