Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

7,074 views

Published on

Entity extraction finds names in documents, providing important raw material for big decisions. But finding all mentions of the name “George Bush” is very different than finding all mentions of the 43rd US President.

Making big decisions from big data is hopeless unless analytics advance from providing snippets of text to providing statements of truth. Such advances present challenges both of accuracy and of usability. We’ll explore these challenges and demonstrate ways of addressing them.

View more slides from the Human Language Technology Conference 2012 here: http://info.basistech.com/hlt-2012-slides

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
7,074
On SlideShare
0
From Embeds
0
Number of Embeds
21
Actions
Shares
0
Downloads
38
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Moving Beyond Entity Extraction to Entity Resolution - Human Language Technology Conference

  1. Things, not Strings:From Entity Extraction to Entity ResolutionDavid MurgatroydVP, EngineeringBasis TechnologyBasis Technology – Human Language Technology Conference 2012 1
  2. MotivationYour job is to analyze reciprocal antagonismbetween Christian and Islamic extremists across theglobe.You want to find information on the Internet onChristian extremist reaction to the killing of the U.S.Ambassador to Libya.Basis Technology – Human Language Technology Conference 2012 2
  3. Basis Technology – Human Language Technology Conference 2012 4
  4. ✗  
  5. ✗  
  6. ✗  
  7. Basis Technology – Human Language Technology Conference 2012 10
  8. ✗  ✗  
  9. ✗  ✗  ✓  
  10. Help?That was a lot of work.Can text analytics help?Basis Technology – Human Language Technology Conference 2012 14
  11. Filter?Filter out pages with the wrong guy? ✗   ✗   ✓  Basis Technology – Human Language Technology Conference 2012 15
  12. Filter Example
  13. Filter?Add some filters (a/k/a facets)… ✗   ✗   ✓  Basis Technology – Human Language Technology Conference 2012 18
  14. Filter?Add some filters (a/k/a facets)… ✗   ✗   ✓  Basis Technology – Human Language Technology Conference 2012 19
  15. Filter? Add some filters (a/k/a facets)…Filter  results  by…  People   <choice  1>   ✗   <choice  2>   <choice  3>   …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 20
  16. Filter? But what can we use as choices?Filter  results  by…  People       <choice  1>   ✗   <choice  2>   <choice  3>   …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 21
  17. Entity Extraction (Name Tagging)Find names of person, places, organizations in document.    Basis Technology – Human Language Technology Conference 2012 22
  18. In-document Coreference ResolutionGroup names referring to the same person, within a document.Basis Technology – Human Language Technology Conference 2012 23
  19. Filter choices? But what can we use as choices?Filter  results  by…  People   <choice  1>   ✗   <choice  2>   <choice  3>   …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 24
  20. Filter choices? Choices: first way that each person was mentioned in each document?Filter  results  by…  Persons  named   Kris  Stephens   ✗   Chris  Stephens   Dan  Cathy   George  LiBle   …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 25
  21. Filter? Choices: first name string for each person in each document?Filtered  by…  Persons  named   Chris  Stephens   ✗  Add  filters…  Persons  named   Dan  Cathy   George  LiBle   …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 26
  22. Filter? Choices: first name string for each person in each document?Filtered  by…  Persons  named   Chris  Stephens  Add  filters…  Persons  named   Dan  Cathy   George  LiBle   …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 27
  23. Filter? Problem: Ambiguity – one name, many entitiesFiltered  by…  Persons  named   Chris  Stephens  Add  filters…  Persons  named   Dan  Cathy   George  LiBle   …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 28
  24. Filter? Problem: Variety – one person, many namesFiltered  by…  Filtered  by…   Persons  named   Chris  Stephens  Add  filters…  Add  filters…  Persons  named   Dan  Cathy   George  LiBle   …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 29
  25. Filter? Problem: Variety – one person, many namesFiltered  by…  Persons  named   Chris  Stephens  Add  filters…  Persons  named   Dan  Cathy   George  LiBle   …   Chris  Stevens   J.  Christopher     ✗      Stevens   …   ✓   Basis Technology – Human Language Technology Conference 2012 30
  26. Where does your favorite data set fall?Variety   #  of  documents   Thousands   Millions   Billions   1   Ambiguity   Basis Technology – Human Language Technology Conference 2012 31
  27. Deal with ambiguity and variety? Magically group names by person across documents.Filter  results  by…  People   <choice  1>   ✗   <choice  2>   <choice  3>   …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 32
  28. Labels for choices? But there’s still the problem of choices…Filter  results  by…  People       <choice  1>   ✗   <choice  2>   <choice  3>   …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 33
  29. Labels for choices? Use person’s name from highest ranked doc? Still some ambiguity.Filter  results  by…  People   Kris  Stephens   ✗   Chris  Stephens  1         Chris  Stephens  2   …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 34
  30. Labels for choices? Entity Resolution: group and also link to a database of known entities (e.g., Wikipedia).Filter  results  by…  People   Kris  Stephens   ✗   J.  Christopher   Chris  Stephens  1              Stevens     Chris  Stephens  2   Chris   …   Stephens     …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 35
  31. Labels for choices? For items not in the database, infer a unique label (e.g., for hypothetical Wikipedia page).Filter  results  by…  People   Kris  Stephens       ✗   J.  Christopher        Stevens     Chris  Stephens         …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 36
  32. Filter? For items not in the database, infer a unique label (e.g., for hypothetical Wikipedia page).Filter  results  by…  People   Kris  Stephens          (pastor)   ✗   J.  Christopher        Stevens     Chris  Stephens      (pastor)           ✗   ✓   Basis Technology – Human Language Technology Conference 2012 37
  33. Filter. Let’s give it a try…Filter  results  by…  People   Kris  Stephens   ✗      (pastor)   J.  Christopher        Stevens     Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …     ✗   ✓   Basis Technology – Human Language Technology Conference 2012 38
  34. Filter. Let’s give it a try…Filtered  by…  People   J.  Christopher   ✗        Stevens    Add  filters…  People   Kris  Stephens      (pastor)   Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 39
  35. Filter. Let’s give it a try…Filtered  by…  People   J.  Christopher        Stevens    Add  filters…  People   Kris  Stephens      (pastor)   Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …   ✓   Basis Technology – Human Language Technology Conference 2012 40
  36. Filter. Let’s give it a try…Filtered  by…  People   J.  Christopher        Stevens    Add  filters…  People   Kris  Stephens      (pastor)   Chris  Stephens      (pastor)     ✓   Dan  Cathy   George  LiBle   …   Basis Technology – Human Language Technology Conference 2012 41
  37. Filter. Let’s give it a try…Filtered  by…  People   J.  Christopher        Stevens    Add  filters…  People   Kris  Stephens      (pastor)   Chris  Stephens      (pastor)     ✓   Dan  Cathy   George  LiBle   …   ✓   ✓   Basis Technology – Human Language Technology Conference 2012 42
  38. Does it work?How do you measure?Basis Technology – Human Language Technology Conference 2012 43
  39. How do you measure? Imagine this was the result of applying the filter with the name from wikipedia.Filtered  by…  People   J.  Christopher        Stevens    Add  filters…  People   Kris  Stephens      (pastor)   Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …   Basis Technology – Human Language Technology Conference 2012 44
  40. How do you measure? Precision: for each document, how much of the stuff grouped with it is correct?Filtered  by…  People   J.  Christopher   ✗    1  /  3  =  33%        Stevens    Add  filters…   ✓   2  /  3  =  67%    People   Kris  Stephens      (pastor)   Chris  Stephens   ✓    2  /  3  =  67%      (pastor)     Dan  Cathy   George  LiBle   …   Basis Technology – Human Language Technology Conference 2012 45
  41. How do you measure? Recall: for each document, how much of the correct stuff is grouped with?Filtered  by…  People   J.  Christopher        Stevens    Add  filters…   ✓   2  /  5  =  40%    People   Kris  Stephens      (pastor)   Chris  Stephens   ✓    2  /  5  =  40%      (pastor)     Dan  Cathy   ✗   George  LiBle   …   ✗   ✗   Basis Technology – Human Language Technology Conference 2012 46
  42. Does it work?We often combine Precision and Recallmeasurements into a singlemeasurement, called “F”.Basis Technology – Human Language Technology Conference 2012 47
  43. Where does your favorite data set fall?Variety   #  of  documents   Thousands   Millions   Billions   1   Ambiguity   Basis Technology – Human Language Technology Conference 2012 48
  44. Where does your favorite data lie? corpus   ACE  2005   WEPS-­‐2   TAC  pre-­‐2012   TAC  eng  2012   TAC  zho  2012   TAC  spa  2012   Basis  Balanced   Basis  Ambig   Basis  Variance  1   Basis  Variance  2   F>=?  Variety   F>=70   #  of  documents   Thousands   Millions   Billions   F>=85   1   Ambiguity   Basis Technology – Human Language Technology Conference 2012 49
  45. Trading off Errors Let’s pretend you’re researching the pastors instead.Filter  results  by…  People   Kris  Stephens      (pastor)   J.  Christopher        Stevens     Chris  Stephens        (pastor)   Dan  Cathy   George  LiBle   …     Basis Technology – Human Language Technology Conference 2012 50
  46. Trading off Errors What if you think there are too many (or too few)? Add a slider for making filter more fine (or coarse).Filtered  by…  People   Kris  Stephens      (pastor)    Add  filters…  People   J.  Christopher        Stevens   Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …   Basis Technology – Human Language Technology Conference 2012 51
  47. Trading off Errors Make the filter more fine.Filtered  by…  People   Kris  Stephens      (pastor)    Add  filters…  People   J.  Christopher        Stevens   Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …   Basis Technology – Human Language Technology Conference 2012 52
  48. Demo
  49. Questions•  Suggested questions: –  Doesn’t Google already do this? –  Speed? Scale? –  Multi-lingual? –  What other uses are there for entity resolution beyond faceted search?Basis Technology – Human Language Technology Conference 2012 54
  50. Thank you!For more information:Visit www.basistech.comWrite to conference@basistech.comCall 617-386-2090Basis Technology – Human Language Technology Conference 2012 55
  51. Doesn’t  Google  already  do  this?  Some, when searching for famous entities.Basis Technology – Human Language Technology Conference 2012 56
  52. Speed/Scale•  Support from BRAVE for scale in CY13!•  Research version: –  tested up to 1m docs –  Sub-second per document –  Incremental updates (i.e., you see documents published minutes ago)Basis Technology – Human Language Technology Conference 2012 57
  53. Doesn’t  Google  already  do  this?  Basis Technology – Human Language Technology Conference 2012 58
  54. Other uses for entity resolution ?•  Supporting relationship resolution by resolving participating entities in the them.•  Knowledge base population•  Integrating disparate data sets•  Alerting•  Improving relevance of search results•  Predictive AnalyticsBasis Technology – Human Language Technology Conference 2012 59

×