SlideShare a Scribd company logo
Lucene/SOLR Revolution 2013 1
From Text to Truth: Real World Facets for
Multilingual Search
Benson Margulies
Executive Vice President and Chief Technical Officer
Lucene/SOLR Revolution 2013 2
Your job is to analyze reciprocal antagonism
between Christian and Islamic extremists across the
globe.
You want to find information on the Internet on
Christian extremist reaction to the killing of the U.S.
Ambassador to Libya.
Motivation
Lucene/SOLR Revolution 2013 4
✗	
  
✗	
  
✗	
  
Lucene/SOLR Revolution 2013 10
✗	
  
✗	
  
✓	
  
✗	
  
✗	
  
Lucene/SOLR Revolution 2013 14
That was a lot of work.
Can text analytics help?
Help?
Lucene/SOLR Revolution 2013 15
✓	
  
✗	
  
✗	
  
Filter out pages with the wrong guy?
Filter?
Lucene/SOLR Revolution 2013 16
✓	
  
✗	
  
✗	
  
Add some filters (a/k/a facets)…
Filter?
Lucene/SOLR Revolution 2013 17
✓	
  
✗	
  
✗	
  
Add some filters (a/k/a facets)…
Filter?
Lucene/SOLR Revolution 2013 18
✓	
  
✗	
  
✗	
  
Add some filters (a/k/a facets)…
Filter?
Filter	
  results	
  by…	
  
People	
  
<choice	
  1>	
  
<choice	
  2>	
  
<choice	
  3>	
  
…	
  
Lucene/SOLR Revolution 2013 19
✓	
  
✗	
  
✗	
  
But what can we use as choices?
Filter?
Filter	
  results	
  by…	
  
People	
  
<choice	
  1>	
  
<choice	
  2>	
  
<choice	
  3>	
  
…	
  
	
  	
  
Lucene/SOLR Revolution 2013 20
Find names of person, places, organizations in document.
Entity Extraction (Name Tagging)
	
  	
  
Lucene/SOLR Revolution 2013 21
Group names referring to the same person, within a document.
In-document Coreference Resolution
Lucene/SOLR Revolution 2013 22
✓	
  
✗	
  
✗	
  
But what can we use as choices?
Filter choices?
Filter	
  results	
  by…	
  
People	
  
<choice	
  1>	
  
<choice	
  2>	
  
<choice	
  3>	
  
…	
  
Lucene/SOLR Revolution 2013 23
✓	
  
✗	
  
✗	
  
Choices: first way that each person was mentioned
in each document?
Filter choices?
Filter	
  results	
  by…	
  
Persons	
  named	
  
Kris	
  Stephens	
  
Chris	
  Stephens	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
Lucene/SOLR Revolution 2013 24
✓	
  
✗	
  
Choices: first name string for each person in each
document?
Filter?
Add	
  filters…	
  
Persons	
  named	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
Filtered	
  by…	
  
Persons	
  named	
  
Chris	
  Stephens	
   ✗	
  
Lucene/SOLR Revolution 2013 25
✓	
  
✗	
  
Choices: first name string for each person in each
document?
Filter?
Add	
  filters…	
  
Persons	
  named	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
Filtered	
  by…	
  
Persons	
  named	
  
Chris	
  Stephens	
  
Lucene/SOLR Revolution 2013 26
✓	
  
✗	
  
Problem: Ambiguity – one name, many entities
Filter?
Add	
  filters…	
  
Persons	
  named	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
Filtered	
  by…	
  
Persons	
  named	
  
Chris	
  Stephens	
  
Lucene/SOLR Revolution 2013 27
✓	
  
✗	
  
Problem: Variety – one person, many names
Filter?
Add	
  filters…	
  
Filtered	
  by…	
  
Add	
  filters…	
  
Persons	
  named	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
Filtered	
  by…	
  
Persons	
  named	
  
Chris	
  Stephens	
  
Lucene/SOLR Revolution 2013 28
✓	
  
✗	
  
Problem: Variety – one person, many names
Filter?
Add	
  filters…	
  
Persons	
  named	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
Chris	
  Stevens	
  
J.	
  Christopher	
  	
  
	
  	
  Stevens	
  
…	
  
Filtered	
  by…	
  
Persons	
  named	
  
Chris	
  Stephens	
  
Lucene/SOLR Revolution 2013 29
✓	
  
✗	
  
✗	
  
Magically group names by person across
documents.
Deal with ambiguity and variety?
Filter	
  results	
  by…	
  
People	
  
<choice	
  1>	
  
<choice	
  2>	
  
<choice	
  3>	
  
…	
  
Lucene/SOLR Revolution 2013 30
✓	
  
✗	
  
✗	
  
But there’s still the problem of choices…
Labels for choices?
Filter	
  results	
  by…	
  
People	
  
<choice	
  1>	
  
<choice	
  2>	
  
<choice	
  3>	
  
…	
  
	
  	
  
Lucene/SOLR Revolution 2013 31
✓	
  
✗	
  
✗	
  
Use person’s name from highest ranked doc?
Still some ambiguity.
Labels for choices?
Filter	
  results	
  by…	
  
People	
  
Kris	
  Stephens	
  
Chris	
  Stephens	
  1	
  	
  
Chris	
  Stephens	
  2	
  
…	
  
	
  	
  
Lucene/SOLR Revolution 2013 32
✓	
  
✗	
  
✗	
  
Entity Resolution: group and also link to a
database of known entities (e.g., Wikipedia).
Labels for choices?
Filter	
  results	
  by…	
  
People	
  
Kris	
  Stephens	
  
Chris	
  Stephens	
  1	
  	
  
Chris	
  Stephens	
  2	
  
…	
  
	
  	
  
Kris	
  Stephens	
  
J.	
  Christopher	
  
	
  	
  	
  Stevens	
  	
  
Chris	
  Stephens	
  	
  
…	
  
Lucene/SOLR Revolution 2013 33
✓	
  
✗	
  
✗	
  
Labels for choices?
Filter	
  results	
  by…	
  
People	
  
For items not in the database, infer a unique
label (e.g., for hypothetical Wikipedia page).
Kris	
  Stephens	
  
J.	
  Christopher	
  
	
  	
  	
  Stevens	
  	
  
Chris	
  Stephens	
  	
  
…	
  
	
  	
  
	
  	
  
Lucene/SOLR Revolution 2013 34
✓	
  
✗	
  
✗	
  
For items not in the database, infer a unique
label (e.g., for hypothetical Wikipedia page).
Filter?
Filter	
  results	
  by…	
  
People	
  
Kris	
  Stephens	
  
	
  	
  (pastor)	
  
J.	
  Christopher	
  
	
  	
  	
  Stevens	
  	
  
Chris	
  Stephens	
  
	
  	
  (pastor)	
  	
  
	
  
	
  	
  
	
  	
  
Lucene/SOLR Revolution 2013 35
✓	
  
✗	
  
✗	
  
Let’s give it a try…
Filter.
Filter	
  results	
  by…	
  
People	
  
Kris	
  Stephens	
  
	
  	
  (pastor)	
  
J.	
  Christopher	
  
	
  	
  	
  Stevens	
  	
  
Chris	
  Stephens	
  
	
  	
  (pastor)	
  	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
	
  
Lucene/SOLR Revolution 2013 36
✓	
  
✗	
  
Let’s give it a try…
Filter.
Add	
  filters…	
  
People	
  
Kris	
  Stephens	
  
	
  	
  (pastor)	
  
Chris	
  Stephens	
  
	
  	
  (pastor)	
  	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
Filtered	
  by…	
  
People	
  
J.	
  Christopher	
  
	
  	
  	
  Stevens	
  	
  
✗	
  
Lucene/SOLR Revolution 2013 37
✓	
  
Let’s give it a try…
Filter.
Add	
  filters…	
  
People	
  
Kris	
  Stephens	
  
	
  	
  (pastor)	
  
Chris	
  Stephens	
  
	
  	
  (pastor)	
  	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
Filtered	
  by…	
  
People	
  
J.	
  Christopher	
  
	
  	
  	
  Stevens	
  	
  
Lucene/SOLR Revolution 2013 38
✓	
  
Let’s give it a try…
Filter.
Add	
  filters…	
  
People	
  
Kris	
  Stephens	
  
	
  	
  (pastor)	
  
Chris	
  Stephens	
  
	
  	
  (pastor)	
  	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
Filtered	
  by…	
  
People	
  
J.	
  Christopher	
  
	
  	
  	
  Stevens	
  	
  
Lucene/SOLR Revolution 2013 39
✓	
  
On a cross lingual index, real-world entity facets can
open results up across languages, unlike search
strings
Filter.
Add	
  filters…	
  
People	
  
Kris	
  Stephens	
  
	
  	
  (pastor)	
  
Chris	
  Stephens	
  
	
  	
  (pastor)	
  	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
Filtered	
  by…	
  
People	
  
J.	
  Christopher	
  
	
  	
  	
  Stevens	
  	
  
✓	
  
✓	
  
Language	
  
English	
  
Chinese	
  
Arabic	
  
Lucene/SOLR Revolution 2013 40
Let’s pretend you’re researching the pastors
instead.
Trading off Errors
Filter	
  results	
  by…	
  
People	
  
Kris	
  Stephens	
  
	
  	
  (pastor)	
  
J.	
  Christopher	
  
	
  	
  	
  Stevens	
  	
  
Chris	
  Stephens	
  
	
  	
  	
  (pastor)	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
	
  
Lucene/SOLR Revolution 2013 41
What if you think there are too many (or too few)?
Add a slider for making filter more fine (or coarse).
Trading off Errors
Add	
  filters…	
  
People	
  
J.	
  Christopher	
  
	
  	
  	
  Stevens	
  
Chris	
  Stephens	
  
	
  	
  (pastor)	
  	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
Filtered	
  by…	
  
People	
  
Kris	
  Stephens	
  
	
  	
  (pastor)	
  	
  
Lucene/SOLR Revolution 2013 42
Make the filter more fine.
Trading off Errors
Add	
  filters…	
  
People	
  
J.	
  Christopher	
  
	
  	
  	
  Stevens	
  
Chris	
  Stephens	
  
	
  	
  (pastor)	
  	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
Filtered	
  by…	
  
People	
  
Kris	
  Stephens	
  
	
  	
  (pastor)	
  	
  
Demo
Lucene/SOLR Revolution 2013 44
RNI Similarity Matching “Tamerlan Tsarnaev”
And the problem only gets worse with Multiple Languages
Lucene/SOLR Revolution 2013 45
Fuzzy name search in Solr
• Facets	
  are	
  one	
  way	
  to	
  navigate	
  names	
  
o  assume	
  that	
  you've	
  found	
  some	
  interesNng	
  data	
  
with	
  an	
  ordinary	
  query	
  
o  what	
  if	
  you	
  are	
  having	
  trouble	
  gePng	
  started?	
  
• Name-­‐specific	
  comparison	
  search	
  is	
  another	
  
• More	
  complex	
  algorithm	
  than	
  levenshtein	
  
distance	
  on	
  names	
  
Lucene/SOLR Revolution 2013 46
Plugging in more complex search
• Open	
  up	
  the	
  'search	
  component	
  pipeline'	
  
• First	
  component	
  preprocesses	
  query	
  
o  Maps	
  from	
  "Fred	
  Chopin"	
  to	
  a	
  complex	
  Lucene	
  
query	
  that	
  looks	
  for	
  possible	
  matches	
  across	
  
languages	
  and	
  scripts	
  
• Second	
  component	
  rescores	
  results	
  
o  detailed	
  comparison	
  of	
  pairs	
  of	
  names	
  to	
  derive	
  
final	
  score.	
  
• Sad	
  limitaNon	
  (so	
  far):	
  scores	
  not	
  normalized	
  
to	
  ordinary	
  Lucene	
  values	
  
Lucene/SOLR Revolution 2013 47
And it does SolrCloud, too ...
• Preprocessor	
  runs	
  before	
  fan-­‐out	
  to	
  shards	
  
• rescoring	
  runs	
  out	
  on	
  the	
  shards	
  
• So	
  the	
  work	
  of	
  checking	
  candidate	
  matches	
  is	
  
divided	
  up	
  amongst	
  the	
  scores.	
  
Lucene/SOLR Revolution 2013 48
Questions
•  Suggested questions:
– Doesn’t Google already do this?
– Speed? Scale?
– Multi-lingual?
– What other uses are there for entity resolution
beyond faceted search?
Lucene/SOLR Revolution 2013 49
Doesn’t	
  Google	
  already	
  do	
  this?	
  
Some, when searching for famous entities.
Lucene/SOLR Revolution 2013 50
Speed/Scale
•  Future Plans include scaling experiments
•  Research version:
– tested up to 1m docs
– Sub-second per document
– Incremental updates (i.e., you see documents
published minutes ago)
Lucene/SOLR Revolution 2013 51
Other uses for entity resolution ?
•  Supporting relationship resolution by resolving
participating entities in the them.
•  Knowledge base population
•  Integrating disparate data sets
•  Alerting
•  Improving relevance of search results
•  Predictive Analytics
Lucene/SOLR Revolution 2013 52
For more information:
Visit www.basistech.com
Write to conference@basistech.com
Call 617-386-2090
Thank you!
Lucene/SOLR Revolution 2013 53
CONFERENCE PARTY
The Tipsy Crow: 770 5th Ave
Starts after Stump The Chump
Your conference badge gets
you in the door
TOMORROW
Breakfast starts at 7:30
Keynotes start at 8:30
CONTACT
Benson Margulies
benson@basistech.com

More Related Content

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
lucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
lucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
lucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
lucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
lucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
lucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
lucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
lucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
lucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
lucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
lucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
lucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
lucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Recently uploaded

Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
timhan337
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
Wasim Ak
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
DhatriParmar
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
Group Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana BuscigliopptxGroup Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana Buscigliopptx
ArianaBusciglio
 
Best Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDABest Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDA
deeptiverma2406
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 

Recently uploaded (20)

Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
Group Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana BuscigliopptxGroup Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana Buscigliopptx
 
Best Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDABest Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDA
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 

From text to truth real world facets for multilingual search

  • 1. Lucene/SOLR Revolution 2013 1 From Text to Truth: Real World Facets for Multilingual Search Benson Margulies Executive Vice President and Chief Technical Officer
  • 2. Lucene/SOLR Revolution 2013 2 Your job is to analyze reciprocal antagonism between Christian and Islamic extremists across the globe. You want to find information on the Internet on Christian extremist reaction to the killing of the U.S. Ambassador to Libya. Motivation
  • 3.
  • 6.
  • 8.
  • 12.
  • 14. Lucene/SOLR Revolution 2013 14 That was a lot of work. Can text analytics help? Help?
  • 15. Lucene/SOLR Revolution 2013 15 ✓   ✗   ✗   Filter out pages with the wrong guy? Filter?
  • 16. Lucene/SOLR Revolution 2013 16 ✓   ✗   ✗   Add some filters (a/k/a facets)… Filter?
  • 17. Lucene/SOLR Revolution 2013 17 ✓   ✗   ✗   Add some filters (a/k/a facets)… Filter?
  • 18. Lucene/SOLR Revolution 2013 18 ✓   ✗   ✗   Add some filters (a/k/a facets)… Filter? Filter  results  by…   People   <choice  1>   <choice  2>   <choice  3>   …  
  • 19. Lucene/SOLR Revolution 2013 19 ✓   ✗   ✗   But what can we use as choices? Filter? Filter  results  by…   People   <choice  1>   <choice  2>   <choice  3>   …      
  • 20. Lucene/SOLR Revolution 2013 20 Find names of person, places, organizations in document. Entity Extraction (Name Tagging)    
  • 21. Lucene/SOLR Revolution 2013 21 Group names referring to the same person, within a document. In-document Coreference Resolution
  • 22. Lucene/SOLR Revolution 2013 22 ✓   ✗   ✗   But what can we use as choices? Filter choices? Filter  results  by…   People   <choice  1>   <choice  2>   <choice  3>   …  
  • 23. Lucene/SOLR Revolution 2013 23 ✓   ✗   ✗   Choices: first way that each person was mentioned in each document? Filter choices? Filter  results  by…   Persons  named   Kris  Stephens   Chris  Stephens   Dan  Cathy   George  LiBle   …  
  • 24. Lucene/SOLR Revolution 2013 24 ✓   ✗   Choices: first name string for each person in each document? Filter? Add  filters…   Persons  named   Dan  Cathy   George  LiBle   …   Filtered  by…   Persons  named   Chris  Stephens   ✗  
  • 25. Lucene/SOLR Revolution 2013 25 ✓   ✗   Choices: first name string for each person in each document? Filter? Add  filters…   Persons  named   Dan  Cathy   George  LiBle   …   Filtered  by…   Persons  named   Chris  Stephens  
  • 26. Lucene/SOLR Revolution 2013 26 ✓   ✗   Problem: Ambiguity – one name, many entities Filter? Add  filters…   Persons  named   Dan  Cathy   George  LiBle   …   Filtered  by…   Persons  named   Chris  Stephens  
  • 27. Lucene/SOLR Revolution 2013 27 ✓   ✗   Problem: Variety – one person, many names Filter? Add  filters…   Filtered  by…   Add  filters…   Persons  named   Dan  Cathy   George  LiBle   …   Filtered  by…   Persons  named   Chris  Stephens  
  • 28. Lucene/SOLR Revolution 2013 28 ✓   ✗   Problem: Variety – one person, many names Filter? Add  filters…   Persons  named   Dan  Cathy   George  LiBle   …   Chris  Stevens   J.  Christopher        Stevens   …   Filtered  by…   Persons  named   Chris  Stephens  
  • 29. Lucene/SOLR Revolution 2013 29 ✓   ✗   ✗   Magically group names by person across documents. Deal with ambiguity and variety? Filter  results  by…   People   <choice  1>   <choice  2>   <choice  3>   …  
  • 30. Lucene/SOLR Revolution 2013 30 ✓   ✗   ✗   But there’s still the problem of choices… Labels for choices? Filter  results  by…   People   <choice  1>   <choice  2>   <choice  3>   …      
  • 31. Lucene/SOLR Revolution 2013 31 ✓   ✗   ✗   Use person’s name from highest ranked doc? Still some ambiguity. Labels for choices? Filter  results  by…   People   Kris  Stephens   Chris  Stephens  1     Chris  Stephens  2   …      
  • 32. Lucene/SOLR Revolution 2013 32 ✓   ✗   ✗   Entity Resolution: group and also link to a database of known entities (e.g., Wikipedia). Labels for choices? Filter  results  by…   People   Kris  Stephens   Chris  Stephens  1     Chris  Stephens  2   …       Kris  Stephens   J.  Christopher        Stevens     Chris  Stephens     …  
  • 33. Lucene/SOLR Revolution 2013 33 ✓   ✗   ✗   Labels for choices? Filter  results  by…   People   For items not in the database, infer a unique label (e.g., for hypothetical Wikipedia page). Kris  Stephens   J.  Christopher        Stevens     Chris  Stephens     …          
  • 34. Lucene/SOLR Revolution 2013 34 ✓   ✗   ✗   For items not in the database, infer a unique label (e.g., for hypothetical Wikipedia page). Filter? Filter  results  by…   People   Kris  Stephens      (pastor)   J.  Christopher        Stevens     Chris  Stephens      (pastor)              
  • 35. Lucene/SOLR Revolution 2013 35 ✓   ✗   ✗   Let’s give it a try… Filter. Filter  results  by…   People   Kris  Stephens      (pastor)   J.  Christopher        Stevens     Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …    
  • 36. Lucene/SOLR Revolution 2013 36 ✓   ✗   Let’s give it a try… Filter. Add  filters…   People   Kris  Stephens      (pastor)   Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …   Filtered  by…   People   J.  Christopher        Stevens     ✗  
  • 37. Lucene/SOLR Revolution 2013 37 ✓   Let’s give it a try… Filter. Add  filters…   People   Kris  Stephens      (pastor)   Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …   Filtered  by…   People   J.  Christopher        Stevens    
  • 38. Lucene/SOLR Revolution 2013 38 ✓   Let’s give it a try… Filter. Add  filters…   People   Kris  Stephens      (pastor)   Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …   Filtered  by…   People   J.  Christopher        Stevens    
  • 39. Lucene/SOLR Revolution 2013 39 ✓   On a cross lingual index, real-world entity facets can open results up across languages, unlike search strings Filter. Add  filters…   People   Kris  Stephens      (pastor)   Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …   Filtered  by…   People   J.  Christopher        Stevens     ✓   ✓   Language   English   Chinese   Arabic  
  • 40. Lucene/SOLR Revolution 2013 40 Let’s pretend you’re researching the pastors instead. Trading off Errors Filter  results  by…   People   Kris  Stephens      (pastor)   J.  Christopher        Stevens     Chris  Stephens        (pastor)   Dan  Cathy   George  LiBle   …    
  • 41. Lucene/SOLR Revolution 2013 41 What if you think there are too many (or too few)? Add a slider for making filter more fine (or coarse). Trading off Errors Add  filters…   People   J.  Christopher        Stevens   Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …   Filtered  by…   People   Kris  Stephens      (pastor)    
  • 42. Lucene/SOLR Revolution 2013 42 Make the filter more fine. Trading off Errors Add  filters…   People   J.  Christopher        Stevens   Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …   Filtered  by…   People   Kris  Stephens      (pastor)    
  • 43. Demo
  • 44. Lucene/SOLR Revolution 2013 44 RNI Similarity Matching “Tamerlan Tsarnaev” And the problem only gets worse with Multiple Languages
  • 45. Lucene/SOLR Revolution 2013 45 Fuzzy name search in Solr • Facets  are  one  way  to  navigate  names   o  assume  that  you've  found  some  interesNng  data   with  an  ordinary  query   o  what  if  you  are  having  trouble  gePng  started?   • Name-­‐specific  comparison  search  is  another   • More  complex  algorithm  than  levenshtein   distance  on  names  
  • 46. Lucene/SOLR Revolution 2013 46 Plugging in more complex search • Open  up  the  'search  component  pipeline'   • First  component  preprocesses  query   o  Maps  from  "Fred  Chopin"  to  a  complex  Lucene   query  that  looks  for  possible  matches  across   languages  and  scripts   • Second  component  rescores  results   o  detailed  comparison  of  pairs  of  names  to  derive   final  score.   • Sad  limitaNon  (so  far):  scores  not  normalized   to  ordinary  Lucene  values  
  • 47. Lucene/SOLR Revolution 2013 47 And it does SolrCloud, too ... • Preprocessor  runs  before  fan-­‐out  to  shards   • rescoring  runs  out  on  the  shards   • So  the  work  of  checking  candidate  matches  is   divided  up  amongst  the  scores.  
  • 48. Lucene/SOLR Revolution 2013 48 Questions •  Suggested questions: – Doesn’t Google already do this? – Speed? Scale? – Multi-lingual? – What other uses are there for entity resolution beyond faceted search?
  • 49. Lucene/SOLR Revolution 2013 49 Doesn’t  Google  already  do  this?   Some, when searching for famous entities.
  • 50. Lucene/SOLR Revolution 2013 50 Speed/Scale •  Future Plans include scaling experiments •  Research version: – tested up to 1m docs – Sub-second per document – Incremental updates (i.e., you see documents published minutes ago)
  • 51. Lucene/SOLR Revolution 2013 51 Other uses for entity resolution ? •  Supporting relationship resolution by resolving participating entities in the them. •  Knowledge base population •  Integrating disparate data sets •  Alerting •  Improving relevance of search results •  Predictive Analytics
  • 52. Lucene/SOLR Revolution 2013 52 For more information: Visit www.basistech.com Write to conference@basistech.com Call 617-386-2090 Thank you!
  • 53. Lucene/SOLR Revolution 2013 53 CONFERENCE PARTY The Tipsy Crow: 770 5th Ave Starts after Stump The Chump Your conference badge gets you in the door TOMORROW Breakfast starts at 7:30 Keynotes start at 8:30 CONTACT Benson Margulies benson@basistech.com