Fuzzy Name Matching with Rosette

•

0 likes•291 views

Normalization is crucial to high quality search results -- who wants irrelevant variations between queries and documents leading to missed hits (e.g., “celebrity” v. “celebrities”)? Normalizing dictionary words works, but what if your application focuses on names? Whether you’re tackling log analysis, e-commerce, watch list screening or other applications, names are often the key. Can you find “Abdul Jabbar, Karim” if you search for “Kareem AbdalJabar” or “كريم عبد الجبار”? Applications using Elasticsearch provide some fuzziness by mixing its built-in edit-distance matching and phonetic analysis with more generic analyzers and filters. We’ve tried to go beyond that to provide both better matching and a simpler integration. We use a custom Mapper and Score Function so that linguistic nuances can be handled behind-the-scenes. We’ll talk about how we built this sort of plug-in for Rosette, its customization, and its connection to broader trend of entity-centric search.

Technology

Fuzzy Name Matching
with Elasticsearch
November 16th 2015
Chris Mack @cgmack
mack@basistech.com

4
02
Why Match Names?
1. Security
2. Fraud
3. Commerce

5
01
Quick survey: How many of you...
• Regularly develop Elastic applications?
• Develop Elastic applications that include names of…
...People?
...Places?
...Products?
...Organizations?
• Have names in languages beside English?

11
01
Best Practice: field per variation type?
http://stackoverflow.com/questions/20632042/elasticsearch-searching-for-human-names
• Create a multi_field type with a
field per possible variation
• Complex query against each field
• Generally gives high recall

12
01
Can’t a name field type do this?
• Manage all the subfields
• Contribute score that reflects phenomena
• Be part of queries using many field types
• Have multiple fields per document
• Have multiple values per field (coming soon)

13
01
But what if variations co-occur?
“Jesus Alfonso Lopez Diaz”
v.
“LobezDias, Chuy”
1) Reordered.
2) Nickname for first name.
3) Missing 2nd Name.
4) Two spelling differences.
5) Missing space.

14
01
Can We Do Better?
• Incorporate our proprietary name matching
• Provide similarity scores to name pairs
• Uses Elasticsearch's Rescore query
• Allows for higher precision ranking and tresholding
• Provides multi-lingual name search

16
01
How does it work?
• Plugin contains custom mapper which does all the
work behind the scenes

17
01
What happens at index time?
• NameMapper indexes keys for different phenomena
in separate (sub) fields

19
01
What happens at query time?
• Step #1: NameMapper generates analogous keys for
a custom Lucene query that finds good candidates f
or re-scoring

20
01
What else happens at query time?
• Step #2: Uses Rescore query to score names in the
best candidate documents and reorder accordingly
- Tuned for high precision name matching
- Computationally expensive

21
01
What does that function do?
• The 'name_score' function matches the query name
against the indexed name in every candidate
document and returns the similarity score

23
01
Rescore Params: Tradeoff Accuracy vs. Speed
• window_size
- Controls how many of the top
documents to rescore
- Tradeoff accuracy vs speed
• minScoreToCheck - (Added by Us)
- Score threshold top doc must meet
to be rescored
- Tradeoff accuracy vs speed

24
01
Rescore Params - Integration w/Query
• rescore_query
- Calls the name_score function to get score
- Combine rescore_queries to query across multiple
fields
• query_weight
- Controls how much weight is given to main query
- Allows for queries on other non-name fields
• rescore_query_weight
- Controls how much weight is given to rescore query

25
01
Summary: How it works
• Central Problem
- Name Variety
- Name Ambiguity
• Custom field type
- Splits a single field into multiple fields covering different phenomena
- Supports multiple name fields in a document as well as multivalued fields
- Intercepts the query to inject a custom Lucene query
• Custom rescore function
- Rescores documents with algorithm specific to name matching
- Limits intense calculations to only top candidates
- Highly configurable

26
01
Resources
• Code
- https://github.com/cgmack/elastic_meetups
• This Presentation
-

29
01
Suggested Questions:
• What is names are in unstructured text?
• What if the names are in other text fields?
• How did you implement multi-valued fields?
• How does it scale?
• How do you handle names not in English?
• How does this relate to the theme of Entity-Centric
Search?
• How do plug-in’s scores relate to Elastic scores?
• How can I learn more?

What's hot

Casting procedure and casting defectsChaithraPrabhu3

12.resin bonded prostheseswww.ffofr.org - Foundation for Oral Facial Rehabilitiation

The Solr (Multi-Terms) Synonyms Maze (Graphs)Bertrand Rigaldies

Laboratory steps of crown and bridge fabricationMuhammad Rafay Imran

Let's Build an Inverted Index: Introduction to Apache Lucene/SolrSease

Base metal alloysDr Apurva Deshmukh Bhandarkar

CASTING DEFECTSAisha Habeeb

Data anonymizationSatyam Agarwala

Intermediate Cypher.pdfNeo4j

Exploring Graph VisualizationNeo4j

Dental Veneersaneeqa_yaqub

Case presentationBahjat Abuhamdan

What's hot (12)

Casting procedure and casting defects

12.resin bonded prostheses

The Solr (Multi-Terms) Synonyms Maze (Graphs)

Laboratory steps of crown and bridge fabrication

Let's Build an Inverted Index: Introduction to Apache Lucene/Solr

Base metal alloys

CASTING DEFECTS

Data anonymization

Intermediate Cypher.pdf

Exploring Graph Visualization

Dental Veneers

Case presentation

Similar to Fuzzy Name Matching with Rosette

Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologyLucidworks

Fuzzy Name Matching in SolrChristopher Mack

Simple fuzzy name matching in solrDavid Murgatroyd

Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...lucenerevolution

Extending Solr: Building a Cloud-like Knowledge Discovery PlatformLucidworks (Archived)

Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks

Extending Solr: Building a Cloud-like Knowledge Discovery PlatformTrey Grainger

Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseLucidworks

Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes

Lessons Learned While Scaling Elasticsearch at VintedDainius Jocas

apidays Australia 2023 - How We Built Our Generative AI Assistant: New Relic ...apidays

Building a near real time search engine & analytics for logs using solrlucenerevolution

Large Data Volume Salesforce experiencesCidar Mendizabal

Designing DDD AggregatesAndrew McCaughan

PyCon 2013 - Experiments in data mining, entity disambiguation and how to thi...Ekta Grover

Reactive Development: Commands, Actors and Events. Oh My!!David Hoerster

Advanced full text searching techniques using LuceneAsad Abbas

Microservices - Is it time to breakup? Dave Nielsen

Introducción a NoSQLMongoDB

Adding data sources to the reporterRogan Hamby

Similar to Fuzzy Name Matching with Rosette (20)

Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

Fuzzy Name Matching in Solr

Simple fuzzy name matching in solr

Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...

Extending Solr: Building a Cloud-like Knowledge Discovery Platform

Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...

Extending Solr: Building a Cloud-like Knowledge Discovery Platform

Relevance in the Wild - Daniel Gomez Vilanueva, Findwise

Dice.com Bay Area Search - Beyond Learning to Rank Talk

Lessons Learned While Scaling Elasticsearch at Vinted

apidays Australia 2023 - How We Built Our Generative AI Assistant: New Relic ...

Building a near real time search engine & analytics for logs using solr

Large Data Volume Salesforce experiences

Designing DDD Aggregates

PyCon 2013 - Experiments in data mining, entity disambiguation and how to thi...

Reactive Development: Commands, Actors and Events. Oh My!!

Advanced full text searching techniques using Lucene

Microservices - Is it time to breakup?

Introducción a NoSQL

Adding data sources to the reporter

Recently uploaded

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Artificial intelligence in the post-deep learning eraDeakin University

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Vulnerability_Management_GRC_by Sohang Sengupta.pptxnull - The Open Security Community

Slack Application Development 101 Slidespraypatel2

The transition to renewables in India.pdfCompetition Advisory Services (India) LLP

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4

AI as an Interface for Commercial BuildingsMemoori

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

Recently uploaded (20)

08448380779 Call Girls In Civil Lines Women Seeking Men

Artificial intelligence in the post-deep learning era

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

Presentation on how to chat with PDF using ChatGPT code interpreter

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

Slack Application Development 101 Slides

The transition to renewables in India.pdf

My Hashitalk Indonesia April 2024 Presentation

IAC 2024 - IA Fast Track to Search Focused AI Solutions

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Breaking the Kubernetes Kill Chain: Host Path Mount

Azure Monitor & Application Insight to monitor Infrastructure & Application

AI as an Interface for Commercial Buildings

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Benefits Of Flutter Compared To Other Frameworks

Fuzzy Name Matching with Rosette

1. Fuzzy Name Matching with Elasticsearch November 16th 2015 Chris Mack @cgmack mack@basistech.com

4. 4 02 Why Match Names? 1. Security 2. Fraud 3. Commerce

5. 5 01 Quick survey: How many of you... • Regularly develop Elastic applications? • Develop Elastic applications that include names of… ...People? ...Places? ...Products? ...Organizations? • Have names in languages beside English?

6. 6 03 What Makes Name Matching Hard?

7. 7 01 Name Variety

8. 8 01 Name Variety

9. 9 01 Name Ambiguity

10. 10 01 How Would You Solve It?

11. 11 01 Best Practice: field per variation type? http://stackoverflow.com/questions/20632042/elasticsearch-searching-for-human-names • Create a multi_field type with a field per possible variation • Complex query against each field • Generally gives high recall

12. 12 01 Can’t a name field type do this? • Manage all the subfields • Contribute score that reflects phenomena • Be part of queries using many field types • Have multiple fields per document • Have multiple values per field (coming soon)

13. 13 01 But what if variations co-occur? “Jesus Alfonso Lopez Diaz” v. “LobezDias, Chuy” 1) Reordered. 2) Nickname for first name. 3) Missing 2nd Name. 4) Two spelling differences. 5) Missing space.

14. 14 01 Can We Do Better? • Incorporate our proprietary name matching • Provide similarity scores to name pairs • Uses Elasticsearch's Rescore query • Allows for higher precision ranking and tresholding • Provides multi-lingual name search

15. Demo

16. 16 01 How does it work? • Plugin contains custom mapper which does all the work behind the scenes

17. 17 01 What happens at index time? • NameMapper indexes keys for different phenomena in separate (sub) fields

18. 18 01 Plug-in Implementation

19. 19 01 What happens at query time? • Step #1: NameMapper generates analogous keys for a custom Lucene query that finds good candidates f or re-scoring

20. 20 01 What else happens at query time? • Step #2: Uses Rescore query to score names in the best candidate documents and reorder accordingly - Tuned for high precision name matching - Computationally expensive

21. 21 01 What does that function do? • The 'name_score' function matches the query name against the indexed name in every candidate document and returns the similarity score

22. 22 01 Plug-in Implementation

23. 23 01 Rescore Params: Tradeoff Accuracy vs. Speed • window_size - Controls how many of the top documents to rescore - Tradeoff accuracy vs speed • minScoreToCheck - (Added by Us) - Score threshold top doc must meet to be rescored - Tradeoff accuracy vs speed

24. 24 01 Rescore Params - Integration w/Query • rescore_query - Calls the name_score function to get score - Combine rescore_queries to query across multiple fields • query_weight - Controls how much weight is given to main query - Allows for queries on other non-name fields • rescore_query_weight - Controls how much weight is given to rescore query

25. 25 01 Summary: How it works • Central Problem - Name Variety - Name Ambiguity • Custom field type - Splits a single field into multiple fields covering different phenomena - Supports multiple name fields in a document as well as multivalued fields - Intercepts the query to inject a custom Lucene query • Custom rescore function - Rescores documents with algorithm specific to name matching - Limits intense calculations to only top candidates - Highly configurable

26. 26 01 Resources • Code - https://github.com/cgmack/elastic_meetups • This Presentation -

27.

28. Fuzzy Name Matching with Elasticsearch November 16th 2015 Chris Mack @cgmack mack@basistech.com

29. 29 01 Suggested Questions: • What is names are in unstructured text? • What if the names are in other text fields? • How did you implement multi-valued fields? • How does it scale? • How do you handle names not in English? • How does this relate to the theme of Entity-Centric Search? • How do plug-in’s scores relate to Elastic scores? • How can I learn more?

Fuzzy Name Matching with Rosette

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to Fuzzy Name Matching with Rosette

Similar to Fuzzy Name Matching with Rosette (20)

Recently uploaded

Recently uploaded (20)

Fuzzy Name Matching with Rosette