Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Simple Fuzzy Name
Matching in Elasticsearch
June 18, 2015
Brian Sawyer
Engineering Manager
bsawyer@basistech.com
Quick survey: How many of us...
● Regularly develop Elastic applications?
● Develop Elastic applications that include
name...
Motivating Questions...
● How could a border officer know whether
you’re on a terrorist watch list?
● How does your bank k...
Answer...
Name Matching (plus more)
What kinds of name variation?
Real life example
David K. Murgatroyd
VP of Engineering
Boarding Pass
Current Best Practice?
● multi_field type with a field per possible
variation (http://stackoverflow.com/questions/20632042...
Can’t a name field type do this?
● Manage all the subfields
● Contribute score that reflects phenomena
● Be part of querie...
But what if variations co-occur?
“Jesus Alfonso Lopez Diaz”
v.
“LobEzDiaS, Chuy”
● Reordered
● Missing token
● Two spellin...
Can we do better?
● Incorporates our proprietary name matching
technology
● Provides similarity scores to name pairs
● Use...
Demo
How could you use such a Field?
● Plugin contains custom mapper which does
all the work behind the scenes
PUT /ofac/ofac/_...
What happens at index time?
● NameMapper indexes keys for different
phenomena in separate (sub) fields
@Override
public vo...
Indexing
{ name: "Robert
Smith"
dob:"
1987/02/13" }
{ name: "Robert
Smith"
name.key1:…
name.key2:…
name.key3:…
dob:
"1987/...
What happens at query time?
● Step #1: NameMapper generates analogous
keys for a custom Lucene query that finds
good candi...
What else happens at query time?
● Step #2: Uses a Rescore query to score names in the
best candidate documents and reorde...
● The 'name_score' function matches the
query name against the indexed name in
every candidate document and returns the
si...
Rescore Query
Main Query
Plug-in Implementation
{
match : {
name:
"Bob Smitty" }
}bool:
name.Key1:...
name.Key2:...
name.K...
● window_size
○ Controls how many of the top documents to rescore
○ Tradeoff accuracy vs speed
● minScoreToCheck - (Added ...
High
Recall
Query
(Elastic)
Subset
High
Recall
Results
Total <
window
size
&
Score >
minimum
Score
Threshold
Rescoring
(fo...
Rescore Params - Integration
w/Query
● rescore_query
○ Calls the name_score function to get score
○ Combine rescore_querie...
What Challenges Were There?
● Design based on similar Solr plugin
● 1-2 months solo develop time
● Nice plugin infrastruct...
Summary: How it works
● Custom field type mapping
○ Splits a single field into multiple fields covering
different phenomen...
Simple Fuzzy Name
Matching in Elasticsearch
June 18, 2015
Brian Sawyer
Engineering Manager
bsawyer@basistech.com
Suggested Questions:
● What if the names are in other text fields?
● How did you implement multi-valued fields?
● How does...
Upcoming SlideShare
Loading in …5
×

Simple fuzzy name matching in elasticsearch

18,939 views

Published on

Normalization is crucial to high quality search results -- who wants irrelevant variations between queries and documents leading to missed hits (e.g., “celebrity” v. “celebrities”)? Normalizing dictionary words works, but what if your application focuses on names? Whether you’re tackling log analysis, e-commerce, watch list screening or other applications, names are often the key. Can you find “Abdul Jabbar, Karim” if you search for “Kareem AbdalJabar” or “كريم عبد الجبار”?

Applications using Elasticsearch provide some fuzziness by mixing its built-in edit-distance matching and phonetic analysis with more generic analyzers and filters. We’ve tried to go beyond that to provide both better matching and a simpler integration. We use a custom Mapper and Score Function so that linguistic nuances can be handled behind-the-scenes. We’ll talk about how we built this sort of plug-in for Rosette, its customization, and its connection to broader trend of entity-centric search.

Published in: Technology
  • Be the first to comment

Simple fuzzy name matching in elasticsearch

  1. 1. Simple Fuzzy Name Matching in Elasticsearch June 18, 2015 Brian Sawyer Engineering Manager bsawyer@basistech.com
  2. 2. Quick survey: How many of us... ● Regularly develop Elastic applications? ● Develop Elastic applications that include names of… ○ ...People? ○ ...Places? ○ ...Products? ○ ...Organizations? ○ …(other entity types)? ● Have names in languages beside English? ● Want to have better name search? ● Are Elasticsearch or plugin developers?
  3. 3. Motivating Questions... ● How could a border officer know whether you’re on a terrorist watch list? ● How does your bank know if you’re wiring money to a drug lord? ● How can an ecommerce site treat “Ho-medics Ultra sonic” and “Homedics Ultrasconic” as the same thing? ● How can a system search for mentions of people across news articles?
  4. 4. Answer... Name Matching (plus more)
  5. 5. What kinds of name variation?
  6. 6. Real life example David K. Murgatroyd VP of Engineering Boarding Pass
  7. 7. Current Best Practice? ● multi_field type with a field per possible variation (http://stackoverflow.com/questions/20632042/elasticsearch-searching-for-human-names) "mappings": { ... "type": "multi_field", "fields": { "pty_surename": { "type": "string", "analyzer": "simple" }, "metaphone": { "type": "string", "analyzer": "metaphone" }, "porter": { "type": "string", "analyzer": "porter" } … ● Complex query against each field ● Generally gives high recall
  8. 8. Can’t a name field type do this? ● Manage all the subfields ● Contribute score that reflects phenomena ● Be part of queries using many field types ● Have multiple fields per document ● Have multiple values per field (coming soon)
  9. 9. But what if variations co-occur? “Jesus Alfonso Lopez Diaz” v. “LobEzDiaS, Chuy” ● Reordered ● Missing token ● Two spelling differences ● Nickname for first name ● Missing space
  10. 10. Can we do better? ● Incorporates our proprietary name matching technology ● Provides similarity scores to name pairs ● Uses Elasticsearch's Rescore query ● Allows for higher precision ranking and tresholding ● Multi-lingual name search
  11. 11. Demo
  12. 12. How could you use such a Field? ● Plugin contains custom mapper which does all the work behind the scenes PUT /ofac/ofac/_mapping { "ofac" : { "properties" : { "name" : { "type:" : "rni_name" } "aka" : { "type:" : "rni_name" } } } }
  13. 13. What happens at index time? ● NameMapper indexes keys for different phenomena in separate (sub) fields @Override public void parse(ParseContext context) throws IOException { Name name = NameBuilder.data(nameString).build(); //Generate keys for name Collection<FieldSpec> fields = helper.deriveFieldsForName(name); //Parse each key with the appropriate Mapper for (FieldSpec field : fields) { Mapper mapper = keyMappers.get(field.getField().fieldName()); context = context.createExternalValueContext(field.getStringValue()); mapper.parse(context); } }
  14. 14. Indexing { name: "Robert Smith" dob:" 1987/02/13" } { name: "Robert Smith" name.key1:… name.key2:… name.key3:… dob: "1987/02/13" } User Doc Plug-in Implementation Index
  15. 15. What happens at query time? ● Step #1: NameMapper generates analogous keys for a custom Lucene query that finds good candidates for re-scoring @Override public Query termQuery(Object value, @Nullable QueryParseContext context) { //Parse name string Name name = NameBuilder.data(value.toString()).build(); QuerySpec spec = helper.buildQuerySpec(new NameIndexQuery(name)); //Build Lucene query Query query = spec.accept(new ESQueryVisitor(names.indexName() + ".")); return query; }
  16. 16. What else happens at query time? ● Step #2: Uses a Rescore query to score names in the best candidate documents and reorder accordingly ○ Tuned for high precision name matching ○ Computationally expensive "rescore" : { "query" : { "rescore_query" : { "function_score" : { "name_score" : { "field" : "name", "query_name" : "LobEzDiaS, Chuy" } ...
  17. 17. ● The 'name_score' function matches the query name against the indexed name in every candidate document and returns the similarity score @Override public double score(int docId, float subQueryScore) { //Create a scorer for the query name CachedScorer cs = createCachedScorer(queryName); //Retrieve name data from doc values nameByteData.setDocument(docId); Name indexName = bytesToName(nameByteData.valueAt(i).bytes); //Score the query against the indexed name in this document return cs.score(indexName); } What does that function do?
  18. 18. Rescore Query Main Query Plug-in Implementation { match : { name: "Bob Smitty" } }bool: name.Key1:... name.Key2:... name.Key3:... User Query Rescore name_score : { field : "name", name : "Bob Smitty") name:"Robert Smith" dob:2/13/1987 score : .79 Indexing { name: "Robert Smith" dob:" 1987/02/13" } { name: "Robert Smith" name.Key1:… name.Key2:… name.Key3:… dob: "1987/02/13" } User Doc Index
  19. 19. ● window_size ○ Controls how many of the top documents to rescore ○ Tradeoff accuracy vs speed ● minScoreToCheck - (Added by Us) ○ Score threshold top doc must meet to be rescored ○ Tradeoff accuracy vs speed Rescore Params - Speed v. Accuracy
  20. 20. High Recall Query (Elastic) Subset High Recall Results Total < window size & Score > minimum Score Threshold Rescoring (for High Precision) Query Scored Results Trading Off Accuracy for Speed
  21. 21. Rescore Params - Integration w/Query ● rescore_query ○ Calls the name_score function to get score ○ Combine rescore_queries to query across multiple fields ● query_weight ○ Controls how much weight is given to main query ○ Allows user to include queries on other non-name fields ● rescore_query_weight ○ Controls how much weight is given to rescore query
  22. 22. What Challenges Were There? ● Design based on similar Solr plugin ● 1-2 months solo develop time ● Nice plugin infrastructure ● Missing some useful javadocs/comments ● No (official) plugin development guide ● Used other plugin implementations as guides https://www.elastic. co/guide/en/elasticsearch/reference/current/modules-plugins.html#_plugins
  23. 23. Summary: How it works ● Custom field type mapping ○ Splits a single field into multiple fields covering different phenomena ○ Supports multiple name fields in a document ○ Intercepts the query to inject a custom Lucene query ● Custom rescore function ○ Rescores documents with algorithm specific to name matching ○ Limits intense calculations to only top candidates ○ Highly configurable
  24. 24. Simple Fuzzy Name Matching in Elasticsearch June 18, 2015 Brian Sawyer Engineering Manager bsawyer@basistech.com
  25. 25. Suggested Questions: ● What if the names are in other text fields? ● How did you implement multi-valued fields? ● How does it scale? ● How do you handle names not in English? ● How does this relate to the theme of Entity- Centric Search? ● How do plug-in’s scores relate to Solr scores?

×