Your SlideShare is downloading. ×
0
Lucene/SOLR Revolution 2013 1From Text to Truth: Real World Facets forMultilingual SearchBenson MarguliesExecutive Vice Pr...
Lucene/SOLR Revolution 2013 2Your job is to analyze reciprocal antagonismbetween Christian and Islamic extremists across t...
Lucene/SOLR Revolution 2013 4
✗	  
✗	  
✗	  
Lucene/SOLR Revolution 2013 10
✗	  ✗	  
✓	  ✗	  ✗	  
Lucene/SOLR Revolution 2013 14That was a lot of work.Can text analytics help?Help?
Lucene/SOLR Revolution 2013 15✓	  ✗	  ✗	  Filter out pages with the wrong guy?Filter?
Lucene/SOLR Revolution 2013 16✓	  ✗	  ✗	  Add some filters (a/k/a facets)…Filter?
Lucene/SOLR Revolution 2013 17✓	  ✗	  ✗	  Add some filters (a/k/a facets)…Filter?
Lucene/SOLR Revolution 2013 18✓	  ✗	  ✗	  Add some filters (a/k/a facets)…Filter?Filter	  results	  by…	  People	  <choice...
Lucene/SOLR Revolution 2013 19✓	  ✗	  ✗	  But what can we use as choices?Filter?Filter	  results	  by…	  People	  <choice	...
Lucene/SOLR Revolution 2013 20Find names of person, places, organizations in document.Entity Extraction (Name Tagging)	  	  
Lucene/SOLR Revolution 2013 21Group names referring to the same person, within a document.In-document Coreference Resolution
Lucene/SOLR Revolution 2013 22✓	  ✗	  ✗	  But what can we use as choices?Filter choices?Filter	  results	  by…	  People	  ...
Lucene/SOLR Revolution 2013 23✓	  ✗	  ✗	  Choices: first way that each person was mentionedin each document?Filter choices...
Lucene/SOLR Revolution 2013 24✓	  ✗	  Choices: first name string for each person in eachdocument?Filter?Add	  filters…	  Pe...
Lucene/SOLR Revolution 2013 25✓	  ✗	  Choices: first name string for each person in eachdocument?Filter?Add	  filters…	  Pe...
Lucene/SOLR Revolution 2013 26✓	  ✗	  Problem: Ambiguity – one name, many entitiesFilter?Add	  filters…	  Persons	  named	 ...
Lucene/SOLR Revolution 2013 27✓	  ✗	  Problem: Variety – one person, many namesFilter?Add	  filters…	  Filtered	  by…	  Add...
Lucene/SOLR Revolution 2013 28✓	  ✗	  Problem: Variety – one person, many namesFilter?Add	  filters…	  Persons	  named	  Da...
Lucene/SOLR Revolution 2013 29✓	  ✗	  ✗	  Magically group names by person acrossdocuments.Deal with ambiguity and variety?...
Lucene/SOLR Revolution 2013 30✓	  ✗	  ✗	  But there’s still the problem of choices…Labels for choices?Filter	  results	  b...
Lucene/SOLR Revolution 2013 31✓	  ✗	  ✗	  Use person’s name from highest ranked doc?Still some ambiguity.Labels for choice...
Lucene/SOLR Revolution 2013 32✓	  ✗	  ✗	  Entity Resolution: group and also link to adatabase of known entities (e.g., Wik...
Lucene/SOLR Revolution 2013 33✓	  ✗	  ✗	  Labels for choices?Filter	  results	  by…	  People	  For items not in the databa...
Lucene/SOLR Revolution 2013 34✓	  ✗	  ✗	  For items not in the database, infer a uniquelabel (e.g., for hypothetical Wikip...
Lucene/SOLR Revolution 2013 35✓	  ✗	  ✗	  Let’s give it a try…Filter.Filter	  results	  by…	  People	  Kris	  Stephens	  	...
Lucene/SOLR Revolution 2013 36✓	  ✗	  Let’s give it a try…Filter.Add	  filters…	  People	  Kris	  Stephens	  	  	  (pastor)...
Lucene/SOLR Revolution 2013 37✓	  Let’s give it a try…Filter.Add	  filters…	  People	  Kris	  Stephens	  	  	  (pastor)	  C...
Lucene/SOLR Revolution 2013 38✓	  Let’s give it a try…Filter.Add	  filters…	  People	  Kris	  Stephens	  	  	  (pastor)	  C...
Lucene/SOLR Revolution 2013 39✓	  On a cross lingual index, real-world entity facets canopen results up across languages, ...
Lucene/SOLR Revolution 2013 40Let’s pretend you’re researching the pastorsinstead.Trading off ErrorsFilter	  results	  by…...
Lucene/SOLR Revolution 2013 41What if you think there are too many (or too few)?Add a slider for making filter more fine (...
Lucene/SOLR Revolution 2013 42Make the filter more fine.Trading off ErrorsAdd	  filters…	  People	  J.	  Christopher	  	  	...
Demo
Lucene/SOLR Revolution 2013 44RNI Similarity Matching “Tamerlan Tsarnaev”And the problem only gets worse with Multiple Lan...
Lucene/SOLR Revolution 2013 45Fuzzy name search in Solr• Facets	  are	  one	  way	  to	  navigate	  names	  o  assume	  th...
Lucene/SOLR Revolution 2013 46Plugging in more complex search• Open	  up	  the	  search	  component	  pipeline	  • First	 ...
Lucene/SOLR Revolution 2013 47And it does SolrCloud, too ...• Preprocessor	  runs	  before	  fan-­‐out	  to	  shards	  • r...
Lucene/SOLR Revolution 2013 48Questions•  Suggested questions:– Doesn’t Google already do this?– Speed? Scale?– Multi-ling...
Lucene/SOLR Revolution 2013 49Doesn’t	  Google	  already	  do	  this?	  Some, when searching for famous entities.
Lucene/SOLR Revolution 2013 50Speed/Scale•  Future Plans include scaling experiments•  Research version:– tested up to 1m ...
Lucene/SOLR Revolution 2013 51Other uses for entity resolution ?•  Supporting relationship resolution by resolvingparticip...
Lucene/SOLR Revolution 2013 52For more information:Visit www.basistech.comWrite to conference@basistech.comCall 617-386-20...
Lucene/SOLR Revolution 2013 53CONFERENCE PARTYThe Tipsy Crow: 770 5th AveStarts after Stump The ChumpYour conference badge...
From text to truth real world facets for multilingual search
From text to truth real world facets for multilingual search
From text to truth real world facets for multilingual search
From text to truth real world facets for multilingual search
Upcoming SlideShare
Loading in...5
×

From text to truth real world facets for multilingual search

740

Published on

Presented by Benson Margulies, Executive Vice President and Chief Technology Officer, Basis Technology

Solr's ability to facet search results gives end-users a valuable way to drill down to what they want. But for unstructured documents, deriving facets such as the persons mentioned requires advanced analytics. Even if names can be extracted from documents, the user doesn't want a "George Bush" facet that intermingles documents mentioning either the 41st and 43rd U.S. Presidents, nor does she want separate facets for "George W. Bush" or even "乔治·沃克·布什" (a Chinese translation) that are limited to just one string. We'll explore the benefits and challenges of empowering Solr users with real-world facets.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
740
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "From text to truth real world facets for multilingual search"

  1. 1. Lucene/SOLR Revolution 2013 1From Text to Truth: Real World Facets forMultilingual SearchBenson MarguliesExecutive Vice President and Chief Technical Officer
  2. 2. Lucene/SOLR Revolution 2013 2Your job is to analyze reciprocal antagonismbetween Christian and Islamic extremists across theglobe.You want to find information on the Internet onChristian extremist reaction to the killing of the U.S.Ambassador to Libya.Motivation
  3. 3. Lucene/SOLR Revolution 2013 4
  4. 4. ✗  
  5. 5. ✗  
  6. 6. ✗  
  7. 7. Lucene/SOLR Revolution 2013 10
  8. 8. ✗  ✗  
  9. 9. ✓  ✗  ✗  
  10. 10. Lucene/SOLR Revolution 2013 14That was a lot of work.Can text analytics help?Help?
  11. 11. Lucene/SOLR Revolution 2013 15✓  ✗  ✗  Filter out pages with the wrong guy?Filter?
  12. 12. Lucene/SOLR Revolution 2013 16✓  ✗  ✗  Add some filters (a/k/a facets)…Filter?
  13. 13. Lucene/SOLR Revolution 2013 17✓  ✗  ✗  Add some filters (a/k/a facets)…Filter?
  14. 14. Lucene/SOLR Revolution 2013 18✓  ✗  ✗  Add some filters (a/k/a facets)…Filter?Filter  results  by…  People  <choice  1>  <choice  2>  <choice  3>  …  
  15. 15. Lucene/SOLR Revolution 2013 19✓  ✗  ✗  But what can we use as choices?Filter?Filter  results  by…  People  <choice  1>  <choice  2>  <choice  3>  …      
  16. 16. Lucene/SOLR Revolution 2013 20Find names of person, places, organizations in document.Entity Extraction (Name Tagging)    
  17. 17. Lucene/SOLR Revolution 2013 21Group names referring to the same person, within a document.In-document Coreference Resolution
  18. 18. Lucene/SOLR Revolution 2013 22✓  ✗  ✗  But what can we use as choices?Filter choices?Filter  results  by…  People  <choice  1>  <choice  2>  <choice  3>  …  
  19. 19. Lucene/SOLR Revolution 2013 23✓  ✗  ✗  Choices: first way that each person was mentionedin each document?Filter choices?Filter  results  by…  Persons  named  Kris  Stephens  Chris  Stephens  Dan  Cathy  George  LiBle  …  
  20. 20. Lucene/SOLR Revolution 2013 24✓  ✗  Choices: first name string for each person in eachdocument?Filter?Add  filters…  Persons  named  Dan  Cathy  George  LiBle  …  Filtered  by…  Persons  named  Chris  Stephens   ✗  
  21. 21. Lucene/SOLR Revolution 2013 25✓  ✗  Choices: first name string for each person in eachdocument?Filter?Add  filters…  Persons  named  Dan  Cathy  George  LiBle  …  Filtered  by…  Persons  named  Chris  Stephens  
  22. 22. Lucene/SOLR Revolution 2013 26✓  ✗  Problem: Ambiguity – one name, many entitiesFilter?Add  filters…  Persons  named  Dan  Cathy  George  LiBle  …  Filtered  by…  Persons  named  Chris  Stephens  
  23. 23. Lucene/SOLR Revolution 2013 27✓  ✗  Problem: Variety – one person, many namesFilter?Add  filters…  Filtered  by…  Add  filters…  Persons  named  Dan  Cathy  George  LiBle  …  Filtered  by…  Persons  named  Chris  Stephens  
  24. 24. Lucene/SOLR Revolution 2013 28✓  ✗  Problem: Variety – one person, many namesFilter?Add  filters…  Persons  named  Dan  Cathy  George  LiBle  …  Chris  Stevens  J.  Christopher        Stevens  …  Filtered  by…  Persons  named  Chris  Stephens  
  25. 25. Lucene/SOLR Revolution 2013 29✓  ✗  ✗  Magically group names by person acrossdocuments.Deal with ambiguity and variety?Filter  results  by…  People  <choice  1>  <choice  2>  <choice  3>  …  
  26. 26. Lucene/SOLR Revolution 2013 30✓  ✗  ✗  But there’s still the problem of choices…Labels for choices?Filter  results  by…  People  <choice  1>  <choice  2>  <choice  3>  …      
  27. 27. Lucene/SOLR Revolution 2013 31✓  ✗  ✗  Use person’s name from highest ranked doc?Still some ambiguity.Labels for choices?Filter  results  by…  People  Kris  Stephens  Chris  Stephens  1    Chris  Stephens  2  …      
  28. 28. Lucene/SOLR Revolution 2013 32✓  ✗  ✗  Entity Resolution: group and also link to adatabase of known entities (e.g., Wikipedia).Labels for choices?Filter  results  by…  People  Kris  Stephens  Chris  Stephens  1    Chris  Stephens  2  …      Kris  Stephens  J.  Christopher        Stevens    Chris  Stephens    …  
  29. 29. Lucene/SOLR Revolution 2013 33✓  ✗  ✗  Labels for choices?Filter  results  by…  People  For items not in the database, infer a uniquelabel (e.g., for hypothetical Wikipedia page).Kris  Stephens  J.  Christopher        Stevens    Chris  Stephens    …          
  30. 30. Lucene/SOLR Revolution 2013 34✓  ✗  ✗  For items not in the database, infer a uniquelabel (e.g., for hypothetical Wikipedia page).Filter?Filter  results  by…  People  Kris  Stephens      (pastor)  J.  Christopher        Stevens    Chris  Stephens      (pastor)              
  31. 31. Lucene/SOLR Revolution 2013 35✓  ✗  ✗  Let’s give it a try…Filter.Filter  results  by…  People  Kris  Stephens      (pastor)  J.  Christopher        Stevens    Chris  Stephens      (pastor)    Dan  Cathy  George  LiBle  …    
  32. 32. Lucene/SOLR Revolution 2013 36✓  ✗  Let’s give it a try…Filter.Add  filters…  People  Kris  Stephens      (pastor)  Chris  Stephens      (pastor)    Dan  Cathy  George  LiBle  …  Filtered  by…  People  J.  Christopher        Stevens    ✗  
  33. 33. Lucene/SOLR Revolution 2013 37✓  Let’s give it a try…Filter.Add  filters…  People  Kris  Stephens      (pastor)  Chris  Stephens      (pastor)    Dan  Cathy  George  LiBle  …  Filtered  by…  People  J.  Christopher        Stevens    
  34. 34. Lucene/SOLR Revolution 2013 38✓  Let’s give it a try…Filter.Add  filters…  People  Kris  Stephens      (pastor)  Chris  Stephens      (pastor)    Dan  Cathy  George  LiBle  …  Filtered  by…  People  J.  Christopher        Stevens    
  35. 35. Lucene/SOLR Revolution 2013 39✓  On a cross lingual index, real-world entity facets canopen results up across languages, unlike searchstringsFilter.Add  filters…  People  Kris  Stephens      (pastor)  Chris  Stephens      (pastor)    Dan  Cathy  George  LiBle  …  Filtered  by…  People  J.  Christopher        Stevens    ✓  ✓  Language  English  Chinese  Arabic  
  36. 36. Lucene/SOLR Revolution 2013 40Let’s pretend you’re researching the pastorsinstead.Trading off ErrorsFilter  results  by…  People  Kris  Stephens      (pastor)  J.  Christopher        Stevens    Chris  Stephens        (pastor)  Dan  Cathy  George  LiBle  …    
  37. 37. Lucene/SOLR Revolution 2013 41What if you think there are too many (or too few)?Add a slider for making filter more fine (or coarse).Trading off ErrorsAdd  filters…  People  J.  Christopher        Stevens  Chris  Stephens      (pastor)    Dan  Cathy  George  LiBle  …  Filtered  by…  People  Kris  Stephens      (pastor)    
  38. 38. Lucene/SOLR Revolution 2013 42Make the filter more fine.Trading off ErrorsAdd  filters…  People  J.  Christopher        Stevens  Chris  Stephens      (pastor)    Dan  Cathy  George  LiBle  …  Filtered  by…  People  Kris  Stephens      (pastor)    
  39. 39. Demo
  40. 40. Lucene/SOLR Revolution 2013 44RNI Similarity Matching “Tamerlan Tsarnaev”And the problem only gets worse with Multiple Languages
  41. 41. Lucene/SOLR Revolution 2013 45Fuzzy name search in Solr• Facets  are  one  way  to  navigate  names  o  assume  that  youve  found  some  interesNng  data  with  an  ordinary  query  o  what  if  you  are  having  trouble  gePng  started?  • Name-­‐specific  comparison  search  is  another  • More  complex  algorithm  than  levenshtein  distance  on  names  
  42. 42. Lucene/SOLR Revolution 2013 46Plugging in more complex search• Open  up  the  search  component  pipeline  • First  component  preprocesses  query  o  Maps  from  "Fred  Chopin"  to  a  complex  Lucene  query  that  looks  for  possible  matches  across  languages  and  scripts  • Second  component  rescores  results  o  detailed  comparison  of  pairs  of  names  to  derive  final  score.  • Sad  limitaNon  (so  far):  scores  not  normalized  to  ordinary  Lucene  values  
  43. 43. Lucene/SOLR Revolution 2013 47And it does SolrCloud, too ...• Preprocessor  runs  before  fan-­‐out  to  shards  • rescoring  runs  out  on  the  shards  • So  the  work  of  checking  candidate  matches  is  divided  up  amongst  the  scores.  
  44. 44. Lucene/SOLR Revolution 2013 48Questions•  Suggested questions:– Doesn’t Google already do this?– Speed? Scale?– Multi-lingual?– What other uses are there for entity resolutionbeyond faceted search?
  45. 45. Lucene/SOLR Revolution 2013 49Doesn’t  Google  already  do  this?  Some, when searching for famous entities.
  46. 46. Lucene/SOLR Revolution 2013 50Speed/Scale•  Future Plans include scaling experiments•  Research version:– tested up to 1m docs– Sub-second per document– Incremental updates (i.e., you see documentspublished minutes ago)
  47. 47. Lucene/SOLR Revolution 2013 51Other uses for entity resolution ?•  Supporting relationship resolution by resolvingparticipating entities in the them.•  Knowledge base population•  Integrating disparate data sets•  Alerting•  Improving relevance of search results•  Predictive Analytics
  48. 48. Lucene/SOLR Revolution 2013 52For more information:Visit www.basistech.comWrite to conference@basistech.comCall 617-386-2090Thank you!
  49. 49. Lucene/SOLR Revolution 2013 53CONFERENCE PARTYThe Tipsy Crow: 770 5th AveStarts after Stump The ChumpYour conference badge getsyou in the doorTOMORROWBreakfast starts at 7:30Keynotes start at 8:30CONTACTBenson Marguliesbenson@basistech.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×