Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Simple Fuzzy Name
Matching in Solr
March 5, 2015
David Murgatroyd & Brian Sawyer
(VP Engineering & Engineering Manager)
Quick survey: How many of us...
● Have ever indexed something into Solr?
● Have seen a Solr Admin interface?
● Regularly d...
Motivating Questions...
● How could CBP know whether you’
re on a terrorist watch list?
● How does your bank know if you’r...
Answer...
Name Matching (plus more)
QueryIndexing
name:"Robert
Smith"
dob:2/13/1987
Doc
Review of Basic Solr
Index
q=name:"Bob
Smitty"
name:"Robert
Smith"
dob...
QueryIndexing
Terrorist
Doc
Where does Solr fit?
Index
Air Traveler
Name
Terrorist
score : .79
QueryIndexing
Sanctioned
Drug Lord
Doc
Where does Solr fit?
Index
Wire Transfer
Beneficiary
Drug Lord
score : .79
Name on your
account
Where does Solr fit?
Name off your
licensescore : .79
What kinds of name variation?
Best Practice: field per variation type?
But what if variations co-occur?
“Jesus A. Lopez Diaz”
v.
“LobezDeaz, Chuy”
● Reordered.
● Missing initial.
● Two spelling...
Can’t a name field type do this? Like…
● Contribute score that reflects phenomena.
● Be part of queries using many field t...
Demo
How could you use such a Field?
● Plugin contains custom field type which does
all the work behind the scenes
● Simple cha...
What happens at index time?
● NameField indexes keys for different
phenomena in separate (sub) fields
Indexing
name:"Robert
Smith"
dob:2/13/1987
name:"Robert
Smith"
name_Key1:…
name_Key2:…
name_Key3:…
dob:2/13/1987
User Doc
...
What happens at query time?
● Step #1: NameField generates analogous
keys for a custom Lucene query that finds
good candid...
What else happens at query time?
● Step #2: Uses Solr’s Rerank feature to rescore names
in top documents and reorder accor...
Rerank Query
Main QueryIndexing
name:"Robert
Smith"
dob:2/13/1987
name:"Robert
Smith"
name_Key1:…
name_Key2:…
name_Key3:…
...
High
Recall
Query
(Solr)
Subset
High
Recall
Results
Score >
reRankScore
Threshold
&
Total <
reRankDocs
ReRank
Rescoring
Qu...
● reRankScoreThreshold - Added by Us
○ Score threshold top doc must meet to be rescored
○ Tradeoff accuracy vs speed
● reR...
Rerank Params - Integration w/Query
● reRankQuery
○ Calls the NameMatch function to get score
○ Can query multiple names o...
Summary: How it works
● Custom field type
○ Splits a single field into multiple fields covering
different phenomena
○ Supp...
Suggested Questions:
● Thank David Smiley for helping? (Yes!)
● What if the names are in other text fields?
● What about s...
Upcoming SlideShare
Loading in …5
×

8

Share

Download to read offline

Simple Fuzzy Name Matching in Solr

Download to read offline

Learn more at https://www.rosette.com/fuzzy-search-names-in-elasticsearch/

We all know normalization is crucial to delivering high quality search results. We don’t want uninteresting variations between the query and the document to lead to missed hits (e.g., “celebrity” v. “celebrities”). Normalization of dictionary words is well understood, but what if your application focuses on names? Whether you’re tackling patent examination, sports records, e-commerce, watchlist screening or many other topics, names are often the key. Can your users find “Abdul Jabbar, Karim” if they search for “Kareem AbdalJabar” or “كريم عبد الجبار”? Solr application architects have attempted to address this through custom integration of nickname lists, edit distance, case normalization, phonetic encoding and n-grams (see example #1 or example #2), but doing so requires significant effort and may not address all desired variations. A simpler approach is to use a Solr field type for names that handles these linguistic nuances behind-the-scenes. We’ll talk about how we built this sort of field type via a Solr plug-in for the Rosette Name Indexer. We’ll also discuss examples of use cases this has enabled, how it can be tuned if necessary, and how it connects to the broader trend of entity-centric search.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Simple Fuzzy Name Matching in Solr

  1. 1. Simple Fuzzy Name Matching in Solr March 5, 2015 David Murgatroyd & Brian Sawyer (VP Engineering & Engineering Manager)
  2. 2. Quick survey: How many of us... ● Have ever indexed something into Solr? ● Have seen a Solr Admin interface? ● Regularly develop Solr applications? ● Develop Solr applications that include names? ● Have wondered how to fuzzy search those names?
  3. 3. Motivating Questions... ● How could CBP know whether you’ re on a terrorist watch list? ● How does your bank know if you’re wiring money to a drug lord? ● How does Airbnb know that’s really your driver’s license?
  4. 4. Answer... Name Matching (plus more)
  5. 5. QueryIndexing name:"Robert Smith" dob:2/13/1987 Doc Review of Basic Solr Index q=name:"Bob Smitty" name:"Robert Smith" dob:2/13/1987 score : .79
  6. 6. QueryIndexing Terrorist Doc Where does Solr fit? Index Air Traveler Name Terrorist score : .79
  7. 7. QueryIndexing Sanctioned Drug Lord Doc Where does Solr fit? Index Wire Transfer Beneficiary Drug Lord score : .79
  8. 8. Name on your account Where does Solr fit? Name off your licensescore : .79
  9. 9. What kinds of name variation?
  10. 10. Best Practice: field per variation type?
  11. 11. But what if variations co-occur? “Jesus A. Lopez Diaz” v. “LobezDeaz, Chuy” ● Reordered. ● Missing initial. ● Two spelling differences ● Nickname for first name. ● Missing space.
  12. 12. Can’t a name field type do this? Like… ● Contribute score that reflects phenomena. ● Be part of queries using many field types. ● Have multiple fields per document. ● Have multiple values per field.
  13. 13. Demo
  14. 14. How could you use such a Field? ● Plugin contains custom field type which does all the work behind the scenes ● Simple change to schema.xml to include new fieldType <fieldType name="rni_name" class="com.basistech.rni. solr.NameField"/> <field name="primaryName" type="rni_name" indexed="true" stored="true" multiValued="false"/> <field name="aka" type="rni_name" indexed="true" stored="true" multiValued="true"/>
  15. 15. What happens at index time? ● NameField indexes keys for different phenomena in separate (sub) fields
  16. 16. Indexing name:"Robert Smith" dob:2/13/1987 name:"Robert Smith" name_Key1:… name_Key2:… name_Key3:… dob:2/13/1987 User Doc Plug-in Implementation Index
  17. 17. What happens at query time? ● Step #1: NameField generates analogous keys for a custom Lucene query that finds good candidates for re-ranking
  18. 18. What else happens at query time? ● Step #2: Uses Solr’s Rerank feature to rescore names in top documents and reorder accordingly ○ &rq={!rniRerank reRankQuery=$rrq} &rrq={!func} rniMatch(fieldName, "John Doe") ○ Tuned for high precision ○ Requires small addition to solrconfig.xml <queryParser name="rniRerank" class="com.basistech. rni.solr.RNIReRankQParserPlugin"/> <valueSourceParser name="rniMatch" class="com. basistech.rni.solr.NameMatchValueSourceParser"/>
  19. 19. Rerank Query Main QueryIndexing name:"Robert Smith" dob:2/13/1987 name:"Robert Smith" name_Key1:… name_Key2:… name_Key3:… dob:2/13/1987 User Doc Plug-in Implementation Index q=name:"Bob Smitty" booleanQuery: name_Key1:... name_Key2:... name_Key3:... User Query Reranker rniMatch(name, "Bob Smitty") name:"Robert Smith" dob:2/13/1987 score : .79
  20. 20. High Recall Query (Solr) Subset High Recall Results Score > reRankScore Threshold & Total < reRankDocs ReRank Rescoring Query Scored Results Trading Off Accuracy for Speed
  21. 21. ● reRankScoreThreshold - Added by Us ○ Score threshold top doc must meet to be rescored ○ Tradeoff accuracy vs speed ● reRankDocs ○ Controls how many of the top documents to rescore ○ Tradeoff accuracy vs speed Rerank Params - Speed v. Accuracy
  22. 22. Rerank Params - Integration w/Query ● reRankQuery ○ Calls the NameMatch function to get score ○ Can query multiple names or other fields ● reRankWeight ○ Controls how much weight is given to name score vs main query ○ Allows user to include queries on other non-name fields ● reRankMode - Added by Us ○ Controls how the rerank score should be combined with main query score ○ Currently 'add' or 'replace'
  23. 23. Summary: How it works ● Custom field type ○ Splits a single field into multiple fields covering different phenomena ○ Supports multiple name fields in a document as well as multivalued fields ○ Intercepts the query to inject a custom Lucene query ● Custom rerank function ○ Rescores documents with algorithm specific to name matching ○ Limits costly calculations to only top candidates ○ Highly configurable
  24. 24. Suggested Questions: ● Thank David Smiley for helping? (Yes!) ● What if the names are in other text fields? ● What about support in Solr 5.0? ● How did you implement multi-valued fields? ● What about support in ElasticSearch? ● How does it scale? ● How do you handle names not in English? ● How does this relate to the theme of Entity- Centric Search? ● How do plug-in’s scores relate to Solr scores?
  • EnginUzuncaova

    Aug. 16, 2019
  • eliezio

    Jan. 12, 2017
  • rcrios

    Sep. 25, 2016
  • MattHodgskiss

    May. 25, 2016
  • pilgrim.in.rails

    Feb. 11, 2016
  • peicheng

    Nov. 5, 2015
  • DavidSmiley2

    Mar. 9, 2015
  • dikchantsahi

    Mar. 9, 2015

Learn more at https://www.rosette.com/fuzzy-search-names-in-elasticsearch/ We all know normalization is crucial to delivering high quality search results. We don’t want uninteresting variations between the query and the document to lead to missed hits (e.g., “celebrity” v. “celebrities”). Normalization of dictionary words is well understood, but what if your application focuses on names? Whether you’re tackling patent examination, sports records, e-commerce, watchlist screening or many other topics, names are often the key. Can your users find “Abdul Jabbar, Karim” if they search for “Kareem AbdalJabar” or “كريم عبد الجبار”? Solr application architects have attempted to address this through custom integration of nickname lists, edit distance, case normalization, phonetic encoding and n-grams (see example #1 or example #2), but doing so requires significant effort and may not address all desired variations. A simpler approach is to use a Solr field type for names that handles these linguistic nuances behind-the-scenes. We’ll talk about how we built this sort of field type via a Solr plug-in for the Rosette Name Indexer. We’ll also discuss examples of use cases this has enabled, how it can be tuned if necessary, and how it connects to the broader trend of entity-centric search.

Views

Total views

8,523

On Slideshare

0

From embeds

0

Number of embeds

1,355

Actions

Downloads

70

Shares

0

Comments

0

Likes

8

×