Fuzzy Name Matching
with Elasticsearch
November 16th 2015
Chris Mack @cgmack
mack@basistech.com
4
02
Why Match Names?
1. Security
2. Fraud
3. Commerce
5
01
Quick survey: How many of you...
• Regularly develop Elastic applications?
• Develop Elastic applications that include names of…
...People?
...Places?
...Products?
...Organizations?
• Have names in languages beside English?
6
03
What Makes Name Matching Hard?
7
01
Name Variety
8
01
Name Variety
9
01
Name Ambiguity
10
01
How Would You Solve It?
11
01
Best Practice: field per variation type?
http://stackoverflow.com/questions/20632042/elasticsearch-searching-for-human-names
• Create a multi_field type with a
field per possible variation
• Complex query against each field
• Generally gives high recall
12
01
Can’t a name field type do this?
• Manage all the subfields
• Contribute score that reflects phenomena
• Be part of queries using many field types
• Have multiple fields per document
• Have multiple values per field (coming soon)
13
01
But what if variations co-occur?
“Jesus Alfonso Lopez Diaz”
v.
“LobezDias, Chuy”
1) Reordered.
2) Nickname for first name.
3) Missing 2nd Name.
4) Two spelling differences.
5) Missing space.
14
01
Can We Do Better?
• Incorporate our proprietary name matching
• Provide similarity scores to name pairs
• Uses Elasticsearch's Rescore query
• Allows for higher precision ranking and tresholding
• Provides multi-lingual name search
Demo
16
01
How does it work?
• Plugin contains custom mapper which does all the
work behind the scenes
17
01
What happens at index time?
• NameMapper indexes keys for different phenomena
in separate (sub) fields
18
01
Plug-in Implementation
19
01
What happens at query time?
• Step #1: NameMapper generates analogous keys for
a custom Lucene query that finds good candidates f
or re-scoring
20
01
What else happens at query time?
• Step #2: Uses Rescore query to score names in the
best candidate documents and reorder accordingly
- Tuned for high precision name matching
- Computationally expensive
21
01
What does that function do?
• The 'name_score' function matches the query name
against the indexed name in every candidate
document and returns the similarity score
22
01
Plug-in Implementation
23
01
Rescore Params: Tradeoff Accuracy vs. Speed
• window_size
- Controls how many of the top
documents to rescore
- Tradeoff accuracy vs speed
• minScoreToCheck - (Added by Us)
- Score threshold top doc must meet
to be rescored
- Tradeoff accuracy vs speed
24
01
Rescore Params - Integration w/Query
• rescore_query
- Calls the name_score function to get score
- Combine rescore_queries to query across multiple
fields
• query_weight
- Controls how much weight is given to main query
- Allows for queries on other non-name fields
• rescore_query_weight
- Controls how much weight is given to rescore query
25
01
Summary: How it works
• Central Problem
- Name Variety
- Name Ambiguity
• Custom field type
- Splits a single field into multiple fields covering different phenomena
- Supports multiple name fields in a document as well as multivalued fields
- Intercepts the query to inject a custom Lucene query
• Custom rescore function
- Rescores documents with algorithm specific to name matching
- Limits intense calculations to only top candidates
- Highly configurable
26
01
Resources
• Code
- https://github.com/cgmack/elastic_meetups
• This Presentation
-
Fuzzy Name Matching
with Elasticsearch
November 16th 2015
Chris Mack @cgmack
mack@basistech.com
29
01
Suggested Questions:
• What is names are in unstructured text?
• What if the names are in other text fields?
• How did you implement multi-valued fields?
• How does it scale?
• How do you handle names not in English?
• How does this relate to the theme of Entity-Centric
Search?
• How do plug-in’s scores relate to Elastic scores?
• How can I learn more?

Fuzzy Name Matching with Rosette

  • 1.
    Fuzzy Name Matching withElasticsearch November 16th 2015 Chris Mack @cgmack mack@basistech.com
  • 4.
    4 02 Why Match Names? 1.Security 2. Fraud 3. Commerce
  • 5.
    5 01 Quick survey: Howmany of you... • Regularly develop Elastic applications? • Develop Elastic applications that include names of… ...People? ...Places? ...Products? ...Organizations? • Have names in languages beside English?
  • 6.
    6 03 What Makes NameMatching Hard?
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
    11 01 Best Practice: fieldper variation type? http://stackoverflow.com/questions/20632042/elasticsearch-searching-for-human-names • Create a multi_field type with a field per possible variation • Complex query against each field • Generally gives high recall
  • 12.
    12 01 Can’t a namefield type do this? • Manage all the subfields • Contribute score that reflects phenomena • Be part of queries using many field types • Have multiple fields per document • Have multiple values per field (coming soon)
  • 13.
    13 01 But what ifvariations co-occur? “Jesus Alfonso Lopez Diaz” v. “LobezDias, Chuy” 1) Reordered. 2) Nickname for first name. 3) Missing 2nd Name. 4) Two spelling differences. 5) Missing space.
  • 14.
    14 01 Can We DoBetter? • Incorporate our proprietary name matching • Provide similarity scores to name pairs • Uses Elasticsearch's Rescore query • Allows for higher precision ranking and tresholding • Provides multi-lingual name search
  • 15.
  • 16.
    16 01 How does itwork? • Plugin contains custom mapper which does all the work behind the scenes
  • 17.
    17 01 What happens atindex time? • NameMapper indexes keys for different phenomena in separate (sub) fields
  • 18.
  • 19.
    19 01 What happens atquery time? • Step #1: NameMapper generates analogous keys for a custom Lucene query that finds good candidates f or re-scoring
  • 20.
    20 01 What else happensat query time? • Step #2: Uses Rescore query to score names in the best candidate documents and reorder accordingly - Tuned for high precision name matching - Computationally expensive
  • 21.
    21 01 What does thatfunction do? • The 'name_score' function matches the query name against the indexed name in every candidate document and returns the similarity score
  • 22.
  • 23.
    23 01 Rescore Params: TradeoffAccuracy vs. Speed • window_size - Controls how many of the top documents to rescore - Tradeoff accuracy vs speed • minScoreToCheck - (Added by Us) - Score threshold top doc must meet to be rescored - Tradeoff accuracy vs speed
  • 24.
    24 01 Rescore Params -Integration w/Query • rescore_query - Calls the name_score function to get score - Combine rescore_queries to query across multiple fields • query_weight - Controls how much weight is given to main query - Allows for queries on other non-name fields • rescore_query_weight - Controls how much weight is given to rescore query
  • 25.
    25 01 Summary: How itworks • Central Problem - Name Variety - Name Ambiguity • Custom field type - Splits a single field into multiple fields covering different phenomena - Supports multiple name fields in a document as well as multivalued fields - Intercepts the query to inject a custom Lucene query • Custom rescore function - Rescores documents with algorithm specific to name matching - Limits intense calculations to only top candidates - Highly configurable
  • 26.
  • 28.
    Fuzzy Name Matching withElasticsearch November 16th 2015 Chris Mack @cgmack mack@basistech.com
  • 29.
    29 01 Suggested Questions: • Whatis names are in unstructured text? • What if the names are in other text fields? • How did you implement multi-valued fields? • How does it scale? • How do you handle names not in English? • How does this relate to the theme of Entity-Centric Search? • How do plug-in’s scores relate to Elastic scores? • How can I learn more?