SlideShare a Scribd company logo
1 of 29
Fuzzy Name Matching
with Elasticsearch
November 16th 2015
Chris Mack @cgmack
mack@basistech.com
4
02
Why Match Names?
1. Security
2. Fraud
3. Commerce
5
01
Quick survey: How many of you...
• Regularly develop Elastic applications?
• Develop Elastic applications that include names of…
...People?
...Places?
...Products?
...Organizations?
• Have names in languages beside English?
6
03
What Makes Name Matching Hard?
7
01
Name Variety
8
01
Name Variety
9
01
Name Ambiguity
10
01
How Would You Solve It?
11
01
Best Practice: field per variation type?
http://stackoverflow.com/questions/20632042/elasticsearch-searching-for-human-names
• Create a multi_field type with a
field per possible variation
• Complex query against each field
• Generally gives high recall
12
01
Can’t a name field type do this?
• Manage all the subfields
• Contribute score that reflects phenomena
• Be part of queries using many field types
• Have multiple fields per document
• Have multiple values per field (coming soon)
13
01
But what if variations co-occur?
“Jesus Alfonso Lopez Diaz”
v.
“LobezDias, Chuy”
1) Reordered.
2) Nickname for first name.
3) Missing 2nd Name.
4) Two spelling differences.
5) Missing space.
14
01
Can We Do Better?
• Incorporate our proprietary name matching
• Provide similarity scores to name pairs
• Uses Elasticsearch's Rescore query
• Allows for higher precision ranking and tresholding
• Provides multi-lingual name search
Demo
16
01
How does it work?
• Plugin contains custom mapper which does all the
work behind the scenes
17
01
What happens at index time?
• NameMapper indexes keys for different phenomena
in separate (sub) fields
18
01
Plug-in Implementation
19
01
What happens at query time?
• Step #1: NameMapper generates analogous keys for
a custom Lucene query that finds good candidates f
or re-scoring
20
01
What else happens at query time?
• Step #2: Uses Rescore query to score names in the
best candidate documents and reorder accordingly
- Tuned for high precision name matching
- Computationally expensive
21
01
What does that function do?
• The 'name_score' function matches the query name
against the indexed name in every candidate
document and returns the similarity score
22
01
Plug-in Implementation
23
01
Rescore Params: Tradeoff Accuracy vs. Speed
• window_size
- Controls how many of the top
documents to rescore
- Tradeoff accuracy vs speed
• minScoreToCheck - (Added by Us)
- Score threshold top doc must meet
to be rescored
- Tradeoff accuracy vs speed
24
01
Rescore Params - Integration w/Query
• rescore_query
- Calls the name_score function to get score
- Combine rescore_queries to query across multiple
fields
• query_weight
- Controls how much weight is given to main query
- Allows for queries on other non-name fields
• rescore_query_weight
- Controls how much weight is given to rescore query
25
01
Summary: How it works
• Central Problem
- Name Variety
- Name Ambiguity
• Custom field type
- Splits a single field into multiple fields covering different phenomena
- Supports multiple name fields in a document as well as multivalued fields
- Intercepts the query to inject a custom Lucene query
• Custom rescore function
- Rescores documents with algorithm specific to name matching
- Limits intense calculations to only top candidates
- Highly configurable
26
01
Resources
• Code
- https://github.com/cgmack/elastic_meetups
• This Presentation
-
Fuzzy Name Matching
with Elasticsearch
November 16th 2015
Chris Mack @cgmack
mack@basistech.com
29
01
Suggested Questions:
• What is names are in unstructured text?
• What if the names are in other text fields?
• How did you implement multi-valued fields?
• How does it scale?
• How do you handle names not in English?
• How does this relate to the theme of Entity-Centric
Search?
• How do plug-in’s scores relate to Elastic scores?
• How can I learn more?

More Related Content

What's hot

Casting procedure and casting defects
Casting procedure and casting defectsCasting procedure and casting defects
Casting procedure and casting defectsChaithraPrabhu3
 
The Solr (Multi-Terms) Synonyms Maze (Graphs)
The Solr (Multi-Terms) Synonyms Maze (Graphs)The Solr (Multi-Terms) Synonyms Maze (Graphs)
The Solr (Multi-Terms) Synonyms Maze (Graphs)Bertrand Rigaldies
 
Laboratory steps of crown and bridge fabrication
Laboratory steps of crown and bridge fabricationLaboratory steps of crown and bridge fabrication
Laboratory steps of crown and bridge fabricationMuhammad Rafay Imran
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrSease
 
Intermediate Cypher.pdf
Intermediate Cypher.pdfIntermediate Cypher.pdf
Intermediate Cypher.pdfNeo4j
 
Exploring Graph Visualization
Exploring Graph VisualizationExploring Graph Visualization
Exploring Graph VisualizationNeo4j
 

What's hot (12)

Casting procedure and casting defects
Casting procedure and casting defectsCasting procedure and casting defects
Casting procedure and casting defects
 
12.resin bonded prostheses
12.resin bonded prostheses12.resin bonded prostheses
12.resin bonded prostheses
 
The Solr (Multi-Terms) Synonyms Maze (Graphs)
The Solr (Multi-Terms) Synonyms Maze (Graphs)The Solr (Multi-Terms) Synonyms Maze (Graphs)
The Solr (Multi-Terms) Synonyms Maze (Graphs)
 
Laboratory steps of crown and bridge fabrication
Laboratory steps of crown and bridge fabricationLaboratory steps of crown and bridge fabrication
Laboratory steps of crown and bridge fabrication
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Base metal alloys
Base metal alloysBase metal alloys
Base metal alloys
 
CASTING DEFECTS
CASTING DEFECTSCASTING DEFECTS
CASTING DEFECTS
 
Data anonymization
Data anonymizationData anonymization
Data anonymization
 
Intermediate Cypher.pdf
Intermediate Cypher.pdfIntermediate Cypher.pdf
Intermediate Cypher.pdf
 
Exploring Graph Visualization
Exploring Graph VisualizationExploring Graph Visualization
Exploring Graph Visualization
 
Dental Veneers
Dental VeneersDental Veneers
Dental Veneers
 
Case presentation
Case presentationCase presentation
Case presentation
 

Similar to Fuzzy Name Matching with Rosette

Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologySimple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologyLucidworks
 
Fuzzy Name Matching in Solr
Fuzzy Name Matching in SolrFuzzy Name Matching in Solr
Fuzzy Name Matching in SolrChristopher Mack
 
Simple fuzzy name matching in solr
Simple fuzzy name matching in solrSimple fuzzy name matching in solr
Simple fuzzy name matching in solrDavid Murgatroyd
 
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...lucenerevolution
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformLucidworks (Archived)
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformTrey Grainger
 
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseRelevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseLucidworks
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
 
Lessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedLessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedDainius Jocas
 
apidays Australia 2023 - How We Built Our Generative AI Assistant: New Relic ...
apidays Australia 2023 - How We Built Our Generative AI Assistant: New Relic ...apidays Australia 2023 - How We Built Our Generative AI Assistant: New Relic ...
apidays Australia 2023 - How We Built Our Generative AI Assistant: New Relic ...apidays
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrlucenerevolution
 
Large Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesLarge Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesCidar Mendizabal
 
PyCon 2013 - Experiments in data mining, entity disambiguation and how to thi...
PyCon 2013 - Experiments in data mining, entity disambiguation and how to thi...PyCon 2013 - Experiments in data mining, entity disambiguation and how to thi...
PyCon 2013 - Experiments in data mining, entity disambiguation and how to thi...Ekta Grover
 
Reactive Development: Commands, Actors and Events. Oh My!!
Reactive Development: Commands, Actors and Events.  Oh My!!Reactive Development: Commands, Actors and Events.  Oh My!!
Reactive Development: Commands, Actors and Events. Oh My!!David Hoerster
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 
Microservices - Is it time to breakup?
Microservices - Is it time to breakup? Microservices - Is it time to breakup?
Microservices - Is it time to breakup? Dave Nielsen
 
Introducción a NoSQL
Introducción a NoSQLIntroducción a NoSQL
Introducción a NoSQLMongoDB
 
Adding data sources to the reporter
Adding data sources to the reporterAdding data sources to the reporter
Adding data sources to the reporterRogan Hamby
 

Similar to Fuzzy Name Matching with Rosette (20)

Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologySimple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
 
Fuzzy Name Matching in Solr
Fuzzy Name Matching in SolrFuzzy Name Matching in Solr
Fuzzy Name Matching in Solr
 
Simple fuzzy name matching in solr
Simple fuzzy name matching in solrSimple fuzzy name matching in solr
Simple fuzzy name matching in solr
 
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
 
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseRelevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
Lessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedLessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at Vinted
 
apidays Australia 2023 - How We Built Our Generative AI Assistant: New Relic ...
apidays Australia 2023 - How We Built Our Generative AI Assistant: New Relic ...apidays Australia 2023 - How We Built Our Generative AI Assistant: New Relic ...
apidays Australia 2023 - How We Built Our Generative AI Assistant: New Relic ...
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solr
 
Large Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesLarge Data Volume Salesforce experiences
Large Data Volume Salesforce experiences
 
Designing DDD Aggregates
Designing DDD AggregatesDesigning DDD Aggregates
Designing DDD Aggregates
 
PyCon 2013 - Experiments in data mining, entity disambiguation and how to thi...
PyCon 2013 - Experiments in data mining, entity disambiguation and how to thi...PyCon 2013 - Experiments in data mining, entity disambiguation and how to thi...
PyCon 2013 - Experiments in data mining, entity disambiguation and how to thi...
 
Reactive Development: Commands, Actors and Events. Oh My!!
Reactive Development: Commands, Actors and Events.  Oh My!!Reactive Development: Commands, Actors and Events.  Oh My!!
Reactive Development: Commands, Actors and Events. Oh My!!
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Microservices - Is it time to breakup?
Microservices - Is it time to breakup? Microservices - Is it time to breakup?
Microservices - Is it time to breakup?
 
Introducción a NoSQL
Introducción a NoSQLIntroducción a NoSQL
Introducción a NoSQL
 
Adding data sources to the reporter
Adding data sources to the reporterAdding data sources to the reporter
Adding data sources to the reporter
 

Recently uploaded

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 

Recently uploaded (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 

Fuzzy Name Matching with Rosette

  • 1. Fuzzy Name Matching with Elasticsearch November 16th 2015 Chris Mack @cgmack mack@basistech.com
  • 2.
  • 3.
  • 4. 4 02 Why Match Names? 1. Security 2. Fraud 3. Commerce
  • 5. 5 01 Quick survey: How many of you... • Regularly develop Elastic applications? • Develop Elastic applications that include names of… ...People? ...Places? ...Products? ...Organizations? • Have names in languages beside English?
  • 6. 6 03 What Makes Name Matching Hard?
  • 10. 10 01 How Would You Solve It?
  • 11. 11 01 Best Practice: field per variation type? http://stackoverflow.com/questions/20632042/elasticsearch-searching-for-human-names • Create a multi_field type with a field per possible variation • Complex query against each field • Generally gives high recall
  • 12. 12 01 Can’t a name field type do this? • Manage all the subfields • Contribute score that reflects phenomena • Be part of queries using many field types • Have multiple fields per document • Have multiple values per field (coming soon)
  • 13. 13 01 But what if variations co-occur? “Jesus Alfonso Lopez Diaz” v. “LobezDias, Chuy” 1) Reordered. 2) Nickname for first name. 3) Missing 2nd Name. 4) Two spelling differences. 5) Missing space.
  • 14. 14 01 Can We Do Better? • Incorporate our proprietary name matching • Provide similarity scores to name pairs • Uses Elasticsearch's Rescore query • Allows for higher precision ranking and tresholding • Provides multi-lingual name search
  • 15. Demo
  • 16. 16 01 How does it work? • Plugin contains custom mapper which does all the work behind the scenes
  • 17. 17 01 What happens at index time? • NameMapper indexes keys for different phenomena in separate (sub) fields
  • 19. 19 01 What happens at query time? • Step #1: NameMapper generates analogous keys for a custom Lucene query that finds good candidates f or re-scoring
  • 20. 20 01 What else happens at query time? • Step #2: Uses Rescore query to score names in the best candidate documents and reorder accordingly - Tuned for high precision name matching - Computationally expensive
  • 21. 21 01 What does that function do? • The 'name_score' function matches the query name against the indexed name in every candidate document and returns the similarity score
  • 23. 23 01 Rescore Params: Tradeoff Accuracy vs. Speed • window_size - Controls how many of the top documents to rescore - Tradeoff accuracy vs speed • minScoreToCheck - (Added by Us) - Score threshold top doc must meet to be rescored - Tradeoff accuracy vs speed
  • 24. 24 01 Rescore Params - Integration w/Query • rescore_query - Calls the name_score function to get score - Combine rescore_queries to query across multiple fields • query_weight - Controls how much weight is given to main query - Allows for queries on other non-name fields • rescore_query_weight - Controls how much weight is given to rescore query
  • 25. 25 01 Summary: How it works • Central Problem - Name Variety - Name Ambiguity • Custom field type - Splits a single field into multiple fields covering different phenomena - Supports multiple name fields in a document as well as multivalued fields - Intercepts the query to inject a custom Lucene query • Custom rescore function - Rescores documents with algorithm specific to name matching - Limits intense calculations to only top candidates - Highly configurable
  • 27.
  • 28. Fuzzy Name Matching with Elasticsearch November 16th 2015 Chris Mack @cgmack mack@basistech.com
  • 29. 29 01 Suggested Questions: • What is names are in unstructured text? • What if the names are in other text fields? • How did you implement multi-valued fields? • How does it scale? • How do you handle names not in English? • How does this relate to the theme of Entity-Centric Search? • How do plug-in’s scores relate to Elastic scores? • How can I learn more?