O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
Simple Fuzzy Name Matching in Solr
Chris Mack
Director Customer Engineering
Basis Technology
5
02
Why Match Names?
Just a Name….
...Right?
1.  Security
2.  Fraud
3.  Commerce
6
01
Quick survey: How many of you...
•  Regularly develop Solr applications?
•  Develop Solr applications that include names of…
...People?
...Places?
...Products?
...Organizations?
•  Have names in languages beside English?
7
03
What Makes Name Matching Hard?
8
01
Name Variety
9
01
Name Variety
10
01
Name Ambiguity
11
01
How Would You Solve It?
12
01
Best Practice: field per variation type?
13
01
Idea: Create a Custom Solr Field
•  Contribute score that reflects phenomena.
•  Be part of queries using many field types.
•  Have multiple fields per document.
•  Have multiple values per field.
14
01
But what if variations co-occur?
“Jesus Alfonso Lopez Diaz”
v.
“LobezDias, Chuy”
1) Reordered.
2) Nickname for first name.
3) Missing 2nd Name.
4) Two spelling differences.
5) Missing space.
15
01
Can We Do Better?
•  Incorporate our proprietary name matching
•  Provide similarity scores to name pairs
•  Use Solr’s Rerank feature
•  Allows for higher precision ranking and tresholding
•  Provides multi-lingual name search
16
01
Simple to Configure
•  Plugin contains custom field type which does all the
work behind the scenes
•  Simple addition to schema.xml to include new field
type
<fieldType name="rni_name" class="com.basistech.rni.solr.NameField"/>
<field name="name" type="rni_name" indexed="true" stored="true" multiValued="false"/>
<field name="aka" type="rni_name" indexed="true" stored="true" multiValued="true"/>
17
01
Plug-in Implementation
18
01
What happens at query time?
•  Step #1: NameField generates analogous keys for a
custom Lucene query that finds good candidates for
re-ranking
public Query getFieldQuery(QParser parser, SchemaField field, String val) {
Name name = parseNameString(externalVal, parser.getParams());
QuerySpec querySpec = buildQuery(name);
return querySpec.accept(new SolrQueryVisitor(field.getName()));
}
19
01
What else happens at query time?
•  Step #2: Uses Solr’s Rerank feature to rescore name
s in top documents and reorder accordingly
- Tuned for high precision
- Simple addition to solrconfig.xml
<queryParser name="rniRerank" class="com.basistech.rni.solr.RNIReRankQParserPlugin"/>
<valueSourceParser name="rniMatch”
class="com.basistech.rni.solr.NameMatchValueSourceParser"/>
20
01
Plug-in Implementation
21
01
Ability to Tradeoff Accuracy vs. Speed
•  reRankScoreThreshold -
Score threshold top doc must
meet to be rescored.
•  reRankDocs - Controls how
many of the top documents
to rescore
22
01
Summary: How it works
•  Custom field type
- Splits a single field into multiple fields covering different phenomena
- Supports multiple name fields in a document as well as multivalued fields
- Intercepts the query to inject a custom Lucene query
•  Custom rerank function
- Rescores documents with algorithm specific to name matching
- Limits intense calculations to only top candidates
- Highly configurable
23
01
Suggested Questions:
•  What is names are in unstructured text?
•  What if the names are in other text fields?
•  How did you implement multi-valued fields?
•  How does it scale?
•  How do you handle names not in English?
•  How does this relate to the theme of Entity-Centric
Search?
•  How do plug-in’s scores relate to Solr scores?
•  How can I learn more?
Simple Fuzzy Name Matching in Solr
Chris Mack
Director Customer Engineering
Basis Technology

Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology

  • 1.
    O C TO B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
  • 2.
    Simple Fuzzy NameMatching in Solr Chris Mack Director Customer Engineering Basis Technology
  • 5.
    5 02 Why Match Names? Justa Name…. ...Right? 1.  Security 2.  Fraud 3.  Commerce
  • 6.
    6 01 Quick survey: Howmany of you... •  Regularly develop Solr applications? •  Develop Solr applications that include names of… ...People? ...Places? ...Products? ...Organizations? •  Have names in languages beside English?
  • 7.
    7 03 What Makes NameMatching Hard?
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
    12 01 Best Practice: fieldper variation type?
  • 13.
    13 01 Idea: Create aCustom Solr Field •  Contribute score that reflects phenomena. •  Be part of queries using many field types. •  Have multiple fields per document. •  Have multiple values per field.
  • 14.
    14 01 But what ifvariations co-occur? “Jesus Alfonso Lopez Diaz” v. “LobezDias, Chuy” 1) Reordered. 2) Nickname for first name. 3) Missing 2nd Name. 4) Two spelling differences. 5) Missing space.
  • 15.
    15 01 Can We DoBetter? •  Incorporate our proprietary name matching •  Provide similarity scores to name pairs •  Use Solr’s Rerank feature •  Allows for higher precision ranking and tresholding •  Provides multi-lingual name search
  • 16.
    16 01 Simple to Configure • Plugin contains custom field type which does all the work behind the scenes •  Simple addition to schema.xml to include new field type <fieldType name="rni_name" class="com.basistech.rni.solr.NameField"/> <field name="name" type="rni_name" indexed="true" stored="true" multiValued="false"/> <field name="aka" type="rni_name" indexed="true" stored="true" multiValued="true"/>
  • 17.
  • 18.
    18 01 What happens atquery time? •  Step #1: NameField generates analogous keys for a custom Lucene query that finds good candidates for re-ranking public Query getFieldQuery(QParser parser, SchemaField field, String val) { Name name = parseNameString(externalVal, parser.getParams()); QuerySpec querySpec = buildQuery(name); return querySpec.accept(new SolrQueryVisitor(field.getName())); }
  • 19.
    19 01 What else happensat query time? •  Step #2: Uses Solr’s Rerank feature to rescore name s in top documents and reorder accordingly - Tuned for high precision - Simple addition to solrconfig.xml <queryParser name="rniRerank" class="com.basistech.rni.solr.RNIReRankQParserPlugin"/> <valueSourceParser name="rniMatch” class="com.basistech.rni.solr.NameMatchValueSourceParser"/>
  • 20.
  • 21.
    21 01 Ability to TradeoffAccuracy vs. Speed •  reRankScoreThreshold - Score threshold top doc must meet to be rescored. •  reRankDocs - Controls how many of the top documents to rescore
  • 22.
    22 01 Summary: How itworks •  Custom field type - Splits a single field into multiple fields covering different phenomena - Supports multiple name fields in a document as well as multivalued fields - Intercepts the query to inject a custom Lucene query •  Custom rerank function - Rescores documents with algorithm specific to name matching - Limits intense calculations to only top candidates - Highly configurable
  • 23.
    23 01 Suggested Questions: •  Whatis names are in unstructured text? •  What if the names are in other text fields? •  How did you implement multi-valued fields? •  How does it scale? •  How do you handle names not in English? •  How does this relate to the theme of Entity-Centric Search? •  How do plug-in’s scores relate to Solr scores? •  How can I learn more?
  • 24.
    Simple Fuzzy NameMatching in Solr Chris Mack Director Customer Engineering Basis Technology