Fuzzy Name Matching with Rosette

•

0 likes•291 views

Normalization is crucial to high quality search results -- who wants irrelevant variations between queries and documents leading to missed hits (e.g., “celebrity” v. “celebrities”)? Normalizing dictionary words works, but what if your application focuses on names? Whether you’re tackling log analysis, e-commerce, watch list screening or other applications, names are often the key. Can you find “Abdul Jabbar, Karim” if you search for “Kareem AbdalJabar” or “كريم عبد الجبار”? Applications using Elasticsearch provide some fuzziness by mixing its built-in edit-distance matching and phonetic analysis with more generic analyzers and filters. We’ve tried to go beyond that to provide both better matching and a simpler integration. We use a custom Mapper and Score Function so that linguistic nuances can be handled behind-the-scenes. We’ll talk about how we built this sort of plug-in for Rosette, its customization, and its connection to broader trend of entity-centric search.

Technology

Fuzzy Name Matching
with Elasticsearch
November 16th 2015
Chris Mack @cgmack
mack@basistech.com

4
02
Why Match Names?
1. Security
2. Fraud
3. Commerce

5
01
Quick survey: How many of you...
• Regularly develop Elastic applications?
• Develop Elastic applications that include names of…
...People?
...Places?
...Products?
...Organizations?
• Have names in languages beside English?

11
01
Best Practice: field per variation type?
http://stackoverflow.com/questions/20632042/elasticsearch-searching-for-human-names
• Create a multi_field type with a
field per possible variation
• Complex query against each field
• Generally gives high recall

12
01
Can’t a name field type do this?
• Manage all the subfields
• Contribute score that reflects phenomena
• Be part of queries using many field types
• Have multiple fields per document
• Have multiple values per field (coming soon)

13
01
But what if variations co-occur?
“Jesus Alfonso Lopez Diaz”
v.
“LobezDias, Chuy”
1) Reordered.
2) Nickname for first name.
3) Missing 2nd Name.
4) Two spelling differences.
5) Missing space.

14
01
Can We Do Better?
• Incorporate our proprietary name matching
• Provide similarity scores to name pairs
• Uses Elasticsearch's Rescore query
• Allows for higher precision ranking and tresholding
• Provides multi-lingual name search

16
01
How does it work?
• Plugin contains custom mapper which does all the
work behind the scenes

17
01
What happens at index time?
• NameMapper indexes keys for different phenomena
in separate (sub) fields

19
01
What happens at query time?
• Step #1: NameMapper generates analogous keys for
a custom Lucene query that finds good candidates f
or re-scoring

20
01
What else happens at query time?
• Step #2: Uses Rescore query to score names in the
best candidate documents and reorder accordingly
- Tuned for high precision name matching
- Computationally expensive

21
01
What does that function do?
• The 'name_score' function matches the query name
against the indexed name in every candidate
document and returns the similarity score

23
01
Rescore Params: Tradeoff Accuracy vs. Speed
• window_size
- Controls how many of the top
documents to rescore
- Tradeoff accuracy vs speed
• minScoreToCheck - (Added by Us)
- Score threshold top doc must meet
to be rescored
- Tradeoff accuracy vs speed

24
01
Rescore Params - Integration w/Query
• rescore_query
- Calls the name_score function to get score
- Combine rescore_queries to query across multiple
fields
• query_weight
- Controls how much weight is given to main query
- Allows for queries on other non-name fields
• rescore_query_weight
- Controls how much weight is given to rescore query

25
01
Summary: How it works
• Central Problem
- Name Variety
- Name Ambiguity
• Custom field type
- Splits a single field into multiple fields covering different phenomena
- Supports multiple name fields in a document as well as multivalued fields
- Intercepts the query to inject a custom Lucene query
• Custom rescore function
- Rescores documents with algorithm specific to name matching
- Limits intense calculations to only top candidates
- Highly configurable

26
01
Resources
• Code
- https://github.com/cgmack/elastic_meetups
• This Presentation
-

29
01
Suggested Questions:
• What is names are in unstructured text?
• What if the names are in other text fields?
• How did you implement multi-valued fields?
• How does it scale?
• How do you handle names not in English?
• How does this relate to the theme of Entity-Centric
Search?
• How do plug-in’s scores relate to Elastic scores?
• How can I learn more?

What's hot

Learn react-jsC...L, NESPRESSO, WAFAASSURANCE, SOFRECOM ORANGE

Managing a SolrCloud cluster using APIsAnshum Gupta

Pivotal tracker를 활용한 팀 프로젝트 관리Byungjin Park

Introduction to XMLFazli Kabashi

Natural Language Processing (NLP)Yuriy Guts

Xpath presentationAlfonso Gabriel López Ceballos

Introduction to spaCyRyo Takahashi

What is front-end development ?Mahmoud Shaker

TypeScript: coding JavaScript without the painSander Mak (@Sander_Mak)

Introduction to natural language processing (NLP)Alia Hamwi

Traversing Graphs with GremlinArtem Chebotko

REST Easy with Django-Rest-FrameworkMarcel Chastain

03 namespaceBaskarkncet

Semantic webMyungjin Lee

Node JS Crash CourseHaim Michael

Database Automation with MySQL Triggers and Event SchedulersAbdul Rahman Sherzad

TypeScriptUdaiappa Ramachandran

XML Schemayht4ever

7. Key-Value Databases: In DepthFabio Fumarola

Overview of React.JS - Internship Presentation - Week 5Devang Garach

What's hot (20)

Learn react-js

Managing a SolrCloud cluster using APIs

Pivotal tracker를 활용한 팀 프로젝트 관리

Introduction to XML

Natural Language Processing (NLP)

Xpath presentation

Introduction to spaCy

What is front-end development ?

TypeScript: coding JavaScript without the pain

Introduction to natural language processing (NLP)

Traversing Graphs with Gremlin

REST Easy with Django-Rest-Framework

03 namespace

Semantic web

Node JS Crash Course

Database Automation with MySQL Triggers and Event Schedulers

TypeScript

XML Schema

7. Key-Value Databases: In Depth

Overview of React.JS - Internship Presentation - Week 5

Similar to Fuzzy Name Matching with Rosette

Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologyLucidworks

Fuzzy Name Matching in SolrChristopher Mack

Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...lucenerevolution

Extending Solr: Building a Cloud-like Knowledge Discovery PlatformLucidworks (Archived)

Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks

Extending Solr: Building a Cloud-like Knowledge Discovery PlatformTrey Grainger

Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseLucidworks

Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes

Lessons Learned While Scaling Elasticsearch at VintedDainius Jocas

apidays Australia 2023 - How We Built Our Generative AI Assistant: New Relic ...apidays

Building a near real time search engine & analytics for logs using solrlucenerevolution

Large Data Volume Salesforce experiencesCidar Mendizabal

Designing DDD AggregatesAndrew McCaughan

PyCon 2013 - Experiments in data mining, entity disambiguation and how to thi...Ekta Grover

Reactive Development: Commands, Actors and Events. Oh My!!David Hoerster

Advanced full text searching techniques using LuceneAsad Abbas

Microservices - Is it time to breakup? Dave Nielsen

Introducción a NoSQLMongoDB

Adding data sources to the reporterRogan Hamby

Scaling Systems: Architectures that growGibraltar Software

Similar to Fuzzy Name Matching with Rosette (20)

Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

Fuzzy Name Matching in Solr

Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...

Extending Solr: Building a Cloud-like Knowledge Discovery Platform

Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...

Extending Solr: Building a Cloud-like Knowledge Discovery Platform

Relevance in the Wild - Daniel Gomez Vilanueva, Findwise

Dice.com Bay Area Search - Beyond Learning to Rank Talk

Lessons Learned While Scaling Elasticsearch at Vinted

apidays Australia 2023 - How We Built Our Generative AI Assistant: New Relic ...

Building a near real time search engine & analytics for logs using solr

Large Data Volume Salesforce experiences

Designing DDD Aggregates

PyCon 2013 - Experiments in data mining, entity disambiguation and how to thi...

Reactive Development: Commands, Actors and Events. Oh My!!

Advanced full text searching techniques using Lucene

Microservices - Is it time to breakup?

Introducción a NoSQL

Adding data sources to the reporter

Scaling Systems: Architectures that grow

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance

JMeter webinar - integration with InfluxDB and GrafanaRTTS

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Product School

Bits & Pixels using AI for Good.........Alison B. Lowndes

Knowledge engineering: from people to machines and backElena Simperl

Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...Product School

Exploring UiPath Orchestrator API: updates and limits in 2024 🚀DianaGray10

PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert

Mission to Decommission: Importance of Decommissioning Products to Increase E...Product School

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer

Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxAbida Shariff

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...Product School

НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»QADay

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlPeter Udo Diehl

How world-class product teams are winning in the AI era by CEO and Founder, P...Product School

Assuring Contact Center Experiences for Your Customers With ThousandEyesThousandEyes

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

JMeter webinar - integration with InfluxDB and Grafana

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

Bits & Pixels using AI for Good.........

Knowledge engineering: from people to machines and back

Connector Corner: Automate dynamic content and events by pushing a button

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Exploring UiPath Orchestrator API: updates and limits in 2024 🚀

PHP Frameworks: I want to break free (IPC Berlin 2024)

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...

Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl

How world-class product teams are winning in the AI era by CEO and Founder, P...

Assuring Contact Center Experiences for Your Customers With ThousandEyes

Fuzzy Name Matching with Rosette

1. Fuzzy Name Matching with Elasticsearch November 16th 2015 Chris Mack @cgmack mack@basistech.com

4. 4 02 Why Match Names? 1. Security 2. Fraud 3. Commerce

5. 5 01 Quick survey: How many of you... • Regularly develop Elastic applications? • Develop Elastic applications that include names of… ...People? ...Places? ...Products? ...Organizations? • Have names in languages beside English?

6. 6 03 What Makes Name Matching Hard?

7. 7 01 Name Variety

8. 8 01 Name Variety

9. 9 01 Name Ambiguity

10. 10 01 How Would You Solve It?

11. 11 01 Best Practice: field per variation type? http://stackoverflow.com/questions/20632042/elasticsearch-searching-for-human-names • Create a multi_field type with a field per possible variation • Complex query against each field • Generally gives high recall

12. 12 01 Can’t a name field type do this? • Manage all the subfields • Contribute score that reflects phenomena • Be part of queries using many field types • Have multiple fields per document • Have multiple values per field (coming soon)

13. 13 01 But what if variations co-occur? “Jesus Alfonso Lopez Diaz” v. “LobezDias, Chuy” 1) Reordered. 2) Nickname for first name. 3) Missing 2nd Name. 4) Two spelling differences. 5) Missing space.

14. 14 01 Can We Do Better? • Incorporate our proprietary name matching • Provide similarity scores to name pairs • Uses Elasticsearch's Rescore query • Allows for higher precision ranking and tresholding • Provides multi-lingual name search

15. Demo

16. 16 01 How does it work? • Plugin contains custom mapper which does all the work behind the scenes

17. 17 01 What happens at index time? • NameMapper indexes keys for different phenomena in separate (sub) fields

18. 18 01 Plug-in Implementation

19. 19 01 What happens at query time? • Step #1: NameMapper generates analogous keys for a custom Lucene query that finds good candidates f or re-scoring

20. 20 01 What else happens at query time? • Step #2: Uses Rescore query to score names in the best candidate documents and reorder accordingly - Tuned for high precision name matching - Computationally expensive

21. 21 01 What does that function do? • The 'name_score' function matches the query name against the indexed name in every candidate document and returns the similarity score

22. 22 01 Plug-in Implementation

23. 23 01 Rescore Params: Tradeoff Accuracy vs. Speed • window_size - Controls how many of the top documents to rescore - Tradeoff accuracy vs speed • minScoreToCheck - (Added by Us) - Score threshold top doc must meet to be rescored - Tradeoff accuracy vs speed

24. 24 01 Rescore Params - Integration w/Query • rescore_query - Calls the name_score function to get score - Combine rescore_queries to query across multiple fields • query_weight - Controls how much weight is given to main query - Allows for queries on other non-name fields • rescore_query_weight - Controls how much weight is given to rescore query

25. 25 01 Summary: How it works • Central Problem - Name Variety - Name Ambiguity • Custom field type - Splits a single field into multiple fields covering different phenomena - Supports multiple name fields in a document as well as multivalued fields - Intercepts the query to inject a custom Lucene query • Custom rescore function - Rescores documents with algorithm specific to name matching - Limits intense calculations to only top candidates - Highly configurable

26. 26 01 Resources • Code - https://github.com/cgmack/elastic_meetups • This Presentation -

27.

28. Fuzzy Name Matching with Elasticsearch November 16th 2015 Chris Mack @cgmack mack@basistech.com

29. 29 01 Suggested Questions: • What is names are in unstructured text? • What if the names are in other text fields? • How did you implement multi-valued fields? • How does it scale? • How do you handle names not in English? • How does this relate to the theme of Entity-Centric Search? • How do plug-in’s scores relate to Elastic scores? • How can I learn more?

Fuzzy Name Matching with Rosette

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Fuzzy Name Matching with Rosette

Similar to Fuzzy Name Matching with Rosette (20)

Recently uploaded

Recently uploaded (20)

Fuzzy Name Matching with Rosette