Simple fuzzy name matching in solr

Simple Fuzzy Name
Matching in Solr
April 23, 2015
David Murgatroyd @dmurga
VP Engineering

Quick survey: How many of us...
● Regularly develop Solr applications?
● Develop Solr applications that include names
of…
○ ...People?
○ ...Places?
○ ...Products?
○ ...Organizations?
○ …(other entity types)?
● Have names in languages beside English?
● Want to have better name search?

Motivating Questions...
● How could a border officer know whether
you’re on a terrorist watch list?
● How does your bank know if you’re wiring
money to a drug lord?
● How can an ecommerce site treat “Ho-medics
Ultra sonic” and “Homedics Ultrasconic” as
the same thing?

Answer...
Name Matching (plus more)

Best Practice: field per variation type?

But what if variations co-occur?
“Jesus Alfonso Lopez Diaz”
v.
“LobEzDiaS, Chuy”
● Reordered.
● Missing initial.
● Two spelling differences
● Nickname for first name.
● Missing space.

Can’t a name field type do this? Like…
● Contribute score that reflects phenomena.
● Be part of queries using many field types.
● Have multiple fields per document.
● Have multiple values per field.

How could you use such a Field?
● Plugin contains custom field type which does
all the work behind the scenes
● Simple addition to schema.xml to include
new fieldType
<fieldType name="rni_name"
class="com.basistech.rni.solr.NameField"/>
<field name="name" type="rni_name" indexed="true"
stored="true" multiValued="false"/>
<field name="aka" type="rni_name" indexed="true"
stored="true" multiValued="true"/>

What happens at index time?
● NameField indexes keys for different
phenomena in separate (sub) fields
List<IndexableField> createFields(SchemaField field, String name) {
Collection<FieldSpec> nameFields = deriveFieldsForName(name);
List<IndexableField> docFields = new ArrayList<>();
for (FieldSpec fs : nameFields) {
docFields.add(new Field(fs.getName(), fs.getStringValue(),
fs.getLuceneField()));
}
docFields.add(createDocValues(field.getName(), new Name(name)));
return docFields;
}

Indexing
name:"Robert
Smith"
dob:2/13/1987
name:"Robert
Smith"
name_Key1:…
name_Key2:…
name_Key3:…
dob:2/13/1987
User Doc
Plug-in Implementation
Index

What happens at query time?
● Step #1: NameField generates analogous
keys for a custom Lucene query that finds
good candidates for re-ranking
public Query getFieldQuery(QParser parser, SchemaField field, String val) {
Name name = parseNameString(externalVal, parser.getParams());
QuerySpec querySpec = buildQuery(name);
return querySpec.accept(new SolrQueryVisitor(field.getName()));
}

What else happens at query time?
● Step #2: Uses Solr’s Rerank feature to rescore names
in top documents and reorder accordingly
○ &rq={!rniRerank reRankQuery=$rrq
reRankWeight=1 reRankMode=replace}
&rrq={!func}rniMatch(name, "LobEzDiaS, Chuy A.")
○ Tuned for high precision
○ Simple addition to solrconfig.xml
<queryParser name="rniRerank"
class="com.basistech.rni.solr.RNIReRankQParserPlugin"/>
<valueSourceParser name="rniMatch"
class="com.basistech.rni.solr.NameMatchValueSourceParse
r"/>

How does that work?
● The NameMatchValueSourceParser parses
the 'rniMatch' rerank query and returns a
function that scores the query name against
the indexed names
public ValueSource parse(FunctionQParser fp) throws SyntaxError {
List<ValueSource> sources = fp.parseValueSourceList();
ValueSource indexNameFieldSrc = sources.get(0);
ValueSource queryNameSrc = sources.get(1);
String queryStr = ((LiteralValueSource)queryNameSrcSrc).getValue();
Name qName = NameField.parseNameString(queryStr, fp.getParams());
return new NameMatchFunction(indexNameFieldSrc, qName);
}

● The NameMatchFunction returns the highest
scoring match in every document that gets
reranked
public double doubleVal(int doc) {
//Get the names indexed as DocValues in this document
BytesRef br = new BytesRef();
indexNameValues.bytesVal(doc, br);
//Deserialize them into Name objects
Name[] indexedNames = NAME_SERIALIZER.bytesToNames(br.bytes);
//Match each against the query name and return the highest score
Double maxScore = 0.0;
for (Name indexName : indexedNames) {
Double score = cs.score(indexName);
maxScore = Math.max(maxScore, score);
}
return maxScore;
}
What does that function do?

Rerank Query
Main QueryIndexing
name:"Robert
Smith"
dob:2/13/1987
name:"Robert
Smith"
name_Key1:…
name_Key2:…
name_Key3:…
dob:2/13/1987
User Doc
Plug-in Implementation
Index
q=name:"Bob
Smitty"
booleanQuery:
name_Key1:...
name_Key2:...
name_Key3:...
User Query
Reranker
rniMatch(name,
"Bob Smitty")
name:"Robert
Smith"
dob:2/13/1987
score : .79

High
Recall
Query
(Solr)
Subset
High
Recall
Results
Score >
reRank
Score
Threshold
&
Total <
reRank
Docs
ReRank
Rescoring
(for High
Precision)
Query
Scored
Results
Trading Off Accuracy for Speed

● reRankScoreThreshold - Added by Us
o Score threshold top doc must meet to be rescored
o Tradeoff accuracy vs speed
● reRankDocs
○ Controls how many of the top documents to rescore
○ Tradeoff accuracy vs speed
Rerank Params - Speed v. Accuracy

Rerank Params - Integration w/Query
● reRankQuery
o Calls the NameMatch function to get score
o Can query multiple names or other fields
● reRankWeight
○ Controls how much weight is given to name score vs
main query
○ Allows user to include queries on other non-name
fields
● reRankMode - Added by Us
○ Controls how the rerank score should be combined
with main query score
○ Currently 'add' or 'replace'

Summary: How it works
● Custom field type
○ Splits a single field into multiple fields covering
different phenomena
○ Supports multiple name fields in a document as well
as multivalued fields
○ Intercepts the query to inject a custom Lucene query
● Custom rerank function
○ Rescores documents with algorithm specific to name
matching
○ Limits intense calculations to only top candidates
○ Highly configurable

Suggested Questions:
● Thank David Smiley for helping? (Yes!)
● What if the names are in other text fields?
● What about support in Solr 5.*?
● How did you implement multi-valued fields?
● How does it scale?
● How do you handle names not in English?
● How does this relate to the theme of Entity-
Centric Search?
● How do plug-in’s scores relate to Solr scores?

Simple fuzzy name matching in solr

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Simple fuzzy name matching in solr

Similar to Simple fuzzy name matching in solr (20)

More from David Murgatroyd

More from David Murgatroyd (14)

Recently uploaded

Recently uploaded (20)

Simple fuzzy name matching in solr

Editor's Notes