SlideShare a Scribd company logo
Simple Fuzzy Name
Matching in Solr
April 23, 2015
David Murgatroyd @dmurga
VP Engineering
Quick survey: How many of us...
● Regularly develop Solr applications?
● Develop Solr applications that include names
of…
○ ...People?
○ ...Places?
○ ...Products?
○ ...Organizations?
○ …(other entity types)?
● Have names in languages beside English?
● Want to have better name search?
Motivating Questions...
● How could a border officer know whether
you’re on a terrorist watch list?
● How does your bank know if you’re wiring
money to a drug lord?
● How can an ecommerce site treat “Ho-medics
Ultra sonic” and “Homedics Ultrasconic” as
the same thing?
Answer...
Name Matching (plus more)
What kinds of name variation?
Best Practice: field per variation type?
But what if variations co-occur?
“Jesus Alfonso Lopez Diaz”
v.
“LobEzDiaS, Chuy”
● Reordered.
● Missing initial.
● Two spelling differences
● Nickname for first name.
● Missing space.
Can’t a name field type do this? Like…
● Contribute score that reflects phenomena.
● Be part of queries using many field types.
● Have multiple fields per document.
● Have multiple values per field.
Demo
How could you use such a Field?
● Plugin contains custom field type which does
all the work behind the scenes
● Simple addition to schema.xml to include
new fieldType
<fieldType name="rni_name"
class="com.basistech.rni.solr.NameField"/>
<field name="name" type="rni_name" indexed="true"
stored="true" multiValued="false"/>
<field name="aka" type="rni_name" indexed="true"
stored="true" multiValued="true"/>
What happens at index time?
● NameField indexes keys for different
phenomena in separate (sub) fields
List<IndexableField> createFields(SchemaField field, String name) {
Collection<FieldSpec> nameFields = deriveFieldsForName(name);
List<IndexableField> docFields = new ArrayList<>();
for (FieldSpec fs : nameFields) {
docFields.add(new Field(fs.getName(), fs.getStringValue(),
fs.getLuceneField()));
}
docFields.add(createDocValues(field.getName(), new Name(name)));
return docFields;
}
Indexing
name:"Robert
Smith"
dob:2/13/1987
name:"Robert
Smith"
name_Key1:…
name_Key2:…
name_Key3:…
dob:2/13/1987
User Doc
Plug-in Implementation
Index
What happens at query time?
● Step #1: NameField generates analogous
keys for a custom Lucene query that finds
good candidates for re-ranking
public Query getFieldQuery(QParser parser, SchemaField field, String val) {
Name name = parseNameString(externalVal, parser.getParams());
QuerySpec querySpec = buildQuery(name);
return querySpec.accept(new SolrQueryVisitor(field.getName()));
}
What else happens at query time?
● Step #2: Uses Solr’s Rerank feature to rescore names
in top documents and reorder accordingly
○ &rq={!rniRerank reRankQuery=$rrq
reRankWeight=1 reRankMode=replace}
&rrq={!func}rniMatch(name, "LobEzDiaS, Chuy A.")
○ Tuned for high precision
○ Simple addition to solrconfig.xml
<queryParser name="rniRerank"
class="com.basistech.rni.solr.RNIReRankQParserPlugin"/>
<valueSourceParser name="rniMatch"
class="com.basistech.rni.solr.NameMatchValueSourceParse
r"/>
How does that work?
● The NameMatchValueSourceParser parses
the 'rniMatch' rerank query and returns a
function that scores the query name against
the indexed names
public ValueSource parse(FunctionQParser fp) throws SyntaxError {
List<ValueSource> sources = fp.parseValueSourceList();
ValueSource indexNameFieldSrc = sources.get(0);
ValueSource queryNameSrc = sources.get(1);
String queryStr = ((LiteralValueSource)queryNameSrcSrc).getValue();
Name qName = NameField.parseNameString(queryStr, fp.getParams());
return new NameMatchFunction(indexNameFieldSrc, qName);
}
● The NameMatchFunction returns the highest
scoring match in every document that gets
reranked
public double doubleVal(int doc) {
//Get the names indexed as DocValues in this document
BytesRef br = new BytesRef();
indexNameValues.bytesVal(doc, br);
//Deserialize them into Name objects
Name[] indexedNames = NAME_SERIALIZER.bytesToNames(br.bytes);
//Match each against the query name and return the highest score
Double maxScore = 0.0;
for (Name indexName : indexedNames) {
Double score = cs.score(indexName);
maxScore = Math.max(maxScore, score);
}
return maxScore;
}
What does that function do?
Rerank Query
Main QueryIndexing
name:"Robert
Smith"
dob:2/13/1987
name:"Robert
Smith"
name_Key1:…
name_Key2:…
name_Key3:…
dob:2/13/1987
User Doc
Plug-in Implementation
Index
q=name:"Bob
Smitty"
booleanQuery:
name_Key1:...
name_Key2:...
name_Key3:...
User Query
Reranker
rniMatch(name,
"Bob Smitty")
name:"Robert
Smith"
dob:2/13/1987
score : .79
High
Recall
Query
(Solr)
Subset
High
Recall
Results
Score >
reRank
Score
Threshold
&
Total <
reRank
Docs
ReRank
Rescoring
(for High
Precision)
Query
Scored
Results
Trading Off Accuracy for Speed
● reRankScoreThreshold - Added by Us
o Score threshold top doc must meet to be rescored
o Tradeoff accuracy vs speed
● reRankDocs
○ Controls how many of the top documents to rescore
○ Tradeoff accuracy vs speed
Rerank Params - Speed v. Accuracy
Rerank Params - Integration w/Query
● reRankQuery
o Calls the NameMatch function to get score
o Can query multiple names or other fields
● reRankWeight
○ Controls how much weight is given to name score vs
main query
○ Allows user to include queries on other non-name
fields
● reRankMode - Added by Us
○ Controls how the rerank score should be combined
with main query score
○ Currently 'add' or 'replace'
Summary: How it works
● Custom field type
○ Splits a single field into multiple fields covering
different phenomena
○ Supports multiple name fields in a document as well
as multivalued fields
○ Intercepts the query to inject a custom Lucene query
● Custom rerank function
○ Rescores documents with algorithm specific to name
matching
○ Limits intense calculations to only top candidates
○ Highly configurable
Suggested Questions:
● Thank David Smiley for helping? (Yes!)
● What if the names are in other text fields?
● What about support in Solr 5.*?
● How did you implement multi-valued fields?
● How does it scale?
● How do you handle names not in English?
● How does this relate to the theme of Entity-
Centric Search?
● How do plug-in’s scores relate to Solr scores?

More Related Content

What's hot

[Pgday.Seoul 2017] 6. GIN vs GiST 인덱스 이야기 - 박진우
[Pgday.Seoul 2017] 6. GIN vs GiST 인덱스 이야기 - 박진우[Pgday.Seoul 2017] 6. GIN vs GiST 인덱스 이야기 - 박진우
[Pgday.Seoul 2017] 6. GIN vs GiST 인덱스 이야기 - 박진우
PgDay.Seoul
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
hypto
 
Google BigQuery Best Practices
Google BigQuery Best PracticesGoogle BigQuery Best Practices
Google BigQuery Best Practices
Matillion
 
Proposal for nested document support in Lucene
Proposal for nested document support in LuceneProposal for nested document support in Lucene
Proposal for nested document support in Lucene
Mark Harwood
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearch
Minsoo Jun
 
Elasticsearch Introduction
Elasticsearch IntroductionElasticsearch Introduction
Elasticsearch Introduction
Roopendra Vishwakarma
 
PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
Reuven Lerner
 
Kali ile Linux'e Giriş | IntelRAD
Kali ile Linux'e Giriş | IntelRADKali ile Linux'e Giriş | IntelRAD
Kali ile Linux'e Giriş | IntelRAD
Mehmet Ince
 
MongoDB (Advanced)
MongoDB (Advanced)MongoDB (Advanced)
MongoDB (Advanced)
TO THE NEW | Technology
 
JSON in Solr: from top to bottom
JSON in Solr: from top to bottomJSON in Solr: from top to bottom
JSON in Solr: from top to bottom
Alexandre Rafalovitch
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
Alexey Grishchenko
 
SPARQL introduction and training (130+ slides with exercices)
SPARQL introduction and training (130+ slides with exercices)SPARQL introduction and training (130+ slides with exercices)
SPARQL introduction and training (130+ slides with exercices)
Thomas Francart
 
RO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research ObjectsRO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research Objects
Carole Goble
 
Delta Lake Cheat Sheet.pdf
Delta Lake Cheat Sheet.pdfDelta Lake Cheat Sheet.pdf
Delta Lake Cheat Sheet.pdf
karansharma62792
 
quick intro to elastic search
quick intro to elastic search quick intro to elastic search
quick intro to elastic search
medcl
 
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
PgDay.Seoul
 
knod22-Alani.pdf
knod22-Alani.pdfknod22-Alani.pdf
knod22-Alani.pdf
The Open University
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
Danny Yuan
 
ElasticSearch : Architecture et Développement
ElasticSearch : Architecture et DéveloppementElasticSearch : Architecture et Développement
ElasticSearch : Architecture et Développement
Mohamed hedi Abidi
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
Neil Baker
 

What's hot (20)

[Pgday.Seoul 2017] 6. GIN vs GiST 인덱스 이야기 - 박진우
[Pgday.Seoul 2017] 6. GIN vs GiST 인덱스 이야기 - 박진우[Pgday.Seoul 2017] 6. GIN vs GiST 인덱스 이야기 - 박진우
[Pgday.Seoul 2017] 6. GIN vs GiST 인덱스 이야기 - 박진우
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Google BigQuery Best Practices
Google BigQuery Best PracticesGoogle BigQuery Best Practices
Google BigQuery Best Practices
 
Proposal for nested document support in Lucene
Proposal for nested document support in LuceneProposal for nested document support in Lucene
Proposal for nested document support in Lucene
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearch
 
Elasticsearch Introduction
Elasticsearch IntroductionElasticsearch Introduction
Elasticsearch Introduction
 
PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
 
Kali ile Linux'e Giriş | IntelRAD
Kali ile Linux'e Giriş | IntelRADKali ile Linux'e Giriş | IntelRAD
Kali ile Linux'e Giriş | IntelRAD
 
MongoDB (Advanced)
MongoDB (Advanced)MongoDB (Advanced)
MongoDB (Advanced)
 
JSON in Solr: from top to bottom
JSON in Solr: from top to bottomJSON in Solr: from top to bottom
JSON in Solr: from top to bottom
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
SPARQL introduction and training (130+ slides with exercices)
SPARQL introduction and training (130+ slides with exercices)SPARQL introduction and training (130+ slides with exercices)
SPARQL introduction and training (130+ slides with exercices)
 
RO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research ObjectsRO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research Objects
 
Delta Lake Cheat Sheet.pdf
Delta Lake Cheat Sheet.pdfDelta Lake Cheat Sheet.pdf
Delta Lake Cheat Sheet.pdf
 
quick intro to elastic search
quick intro to elastic search quick intro to elastic search
quick intro to elastic search
 
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
 
knod22-Alani.pdf
knod22-Alani.pdfknod22-Alani.pdf
knod22-Alani.pdf
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 
ElasticSearch : Architecture et Développement
ElasticSearch : Architecture et DéveloppementElasticSearch : Architecture et Développement
ElasticSearch : Architecture et Développement
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
 

Similar to Simple fuzzy name matching in solr

Fuzzy Name Matching with Rosette
Fuzzy Name Matching with RosetteFuzzy Name Matching with Rosette
Fuzzy Name Matching with Rosette
Christopher Mack
 
Simple fuzzy Name Matching in Elasticsearch - Graham Morehead
Simple fuzzy Name Matching in Elasticsearch - Graham MoreheadSimple fuzzy Name Matching in Elasticsearch - Graham Morehead
Simple fuzzy Name Matching in Elasticsearch - Graham Morehead
Basis Technology
 
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologySimple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Lucidworks
 
Fuzzy Name Matching in Solr
Fuzzy Name Matching in SolrFuzzy Name Matching in Solr
Fuzzy Name Matching in Solr
Christopher Mack
 
Simple fuzzy name matching in elasticsearch paris meetup
Simple fuzzy name matching in elasticsearch   paris meetupSimple fuzzy name matching in elasticsearch   paris meetup
Simple fuzzy name matching in elasticsearch paris meetup
Basis Technology
 
Instant search - A hands-on tutorial
Instant search  - A hands-on tutorialInstant search  - A hands-on tutorial
Instant search - A hands-on tutorial
Ganesh Venkataraman
 
Clean code: meaningful Name
Clean code: meaningful NameClean code: meaningful Name
Clean code: meaningful Name
nahid035
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Kai Chan
 
Querydsl fin jug - june 2012
Querydsl   fin jug - june 2012Querydsl   fin jug - june 2012
Querydsl fin jug - june 2012
Timo Westkämper
 
Build Your Own World Class Directory Search From Alpha to Omega
Build Your Own World Class Directory Search From Alpha to OmegaBuild Your Own World Class Directory Search From Alpha to Omega
Build Your Own World Class Directory Search From Alpha to Omega
Ravi Mynampaty
 
DIY Percolator
DIY PercolatorDIY Percolator
DIY Percolator
jdhok
 
Creating web applications with LODSPeaKr
Creating web applications with LODSPeaKrCreating web applications with LODSPeaKr
Creating web applications with LODSPeaKr
Alvaro Graves
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
Trey Grainger
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Kai Chan
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
Sean Golliher
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
Mukesh Singh
 
Naming Standards, Clean Code
Naming Standards, Clean CodeNaming Standards, Clean Code
Naming Standards, Clean Code
CleanestCode
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Trey Grainger
 
Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA
Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USARelevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA
Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA
Leonardo Dias
 
Apache Solr for begginers
Apache Solr for begginersApache Solr for begginers
Apache Solr for begginers
Alexander Tokarev
 

Similar to Simple fuzzy name matching in solr (20)

Fuzzy Name Matching with Rosette
Fuzzy Name Matching with RosetteFuzzy Name Matching with Rosette
Fuzzy Name Matching with Rosette
 
Simple fuzzy Name Matching in Elasticsearch - Graham Morehead
Simple fuzzy Name Matching in Elasticsearch - Graham MoreheadSimple fuzzy Name Matching in Elasticsearch - Graham Morehead
Simple fuzzy Name Matching in Elasticsearch - Graham Morehead
 
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologySimple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
 
Fuzzy Name Matching in Solr
Fuzzy Name Matching in SolrFuzzy Name Matching in Solr
Fuzzy Name Matching in Solr
 
Simple fuzzy name matching in elasticsearch paris meetup
Simple fuzzy name matching in elasticsearch   paris meetupSimple fuzzy name matching in elasticsearch   paris meetup
Simple fuzzy name matching in elasticsearch paris meetup
 
Instant search - A hands-on tutorial
Instant search  - A hands-on tutorialInstant search  - A hands-on tutorial
Instant search - A hands-on tutorial
 
Clean code: meaningful Name
Clean code: meaningful NameClean code: meaningful Name
Clean code: meaningful Name
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
 
Querydsl fin jug - june 2012
Querydsl   fin jug - june 2012Querydsl   fin jug - june 2012
Querydsl fin jug - june 2012
 
Build Your Own World Class Directory Search From Alpha to Omega
Build Your Own World Class Directory Search From Alpha to OmegaBuild Your Own World Class Directory Search From Alpha to Omega
Build Your Own World Class Directory Search From Alpha to Omega
 
DIY Percolator
DIY PercolatorDIY Percolator
DIY Percolator
 
Creating web applications with LODSPeaKr
Creating web applications with LODSPeaKrCreating web applications with LODSPeaKr
Creating web applications with LODSPeaKr
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Naming Standards, Clean Code
Naming Standards, Clean CodeNaming Standards, Clean Code
Naming Standards, Clean Code
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA
Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USARelevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA
Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA
 
Apache Solr for begginers
Apache Solr for begginersApache Solr for begginers
Apache Solr for begginers
 

More from David Murgatroyd

Mission-Driven Machine Learning
Mission-Driven Machine LearningMission-Driven Machine Learning
Mission-Driven Machine Learning
David Murgatroyd
 
Leveraging AI the Right Way (for Product Managers)
Leveraging AI the Right Way (for Product Managers)Leveraging AI the Right Way (for Product Managers)
Leveraging AI the Right Way (for Product Managers)
David Murgatroyd
 
Managing Your Machine Learning Portfolio
Managing Your Machine Learning PortfolioManaging Your Machine Learning Portfolio
Managing Your Machine Learning Portfolio
David Murgatroyd
 
How to train your product owner
How to train your product ownerHow to train your product owner
How to train your product owner
David Murgatroyd
 
Technology & Faith: from Coding to Culture
Technology & Faith: from Coding to CultureTechnology & Faith: from Coding to Culture
Technology & Faith: from Coding to Culture
David Murgatroyd
 
Agile Deep Learning
Agile Deep LearningAgile Deep Learning
Agile Deep Learning
David Murgatroyd
 
Choosing a Job for the Right Reasons
Choosing a Job for the Right ReasonsChoosing a Job for the Right Reasons
Choosing a Job for the Right Reasons
David Murgatroyd
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
David Murgatroyd
 
NLP in the Real World
NLP in the Real WorldNLP in the Real World
NLP in the Real World
David Murgatroyd
 
System combination for HLT
System combination for HLTSystem combination for HLT
System combination for HLT
David Murgatroyd
 
HltCon overview
HltCon overviewHltCon overview
HltCon overview
David Murgatroyd
 
Linguistic Considerations of Identity Resolution (2008)
Linguistic Considerations of Identity Resolution (2008)Linguistic Considerations of Identity Resolution (2008)
Linguistic Considerations of Identity Resolution (2008)
David Murgatroyd
 
Moving beyond-entity-extraction-to-entity-resolution-david-murgatroyd-human-l...
Moving beyond-entity-extraction-to-entity-resolution-david-murgatroyd-human-l...Moving beyond-entity-extraction-to-entity-resolution-david-murgatroyd-human-l...
Moving beyond-entity-extraction-to-entity-resolution-david-murgatroyd-human-l...
David Murgatroyd
 
From Research to Reality: Advances in HLT 2013
From Research to Reality: Advances in HLT 2013From Research to Reality: Advances in HLT 2013
From Research to Reality: Advances in HLT 2013
David Murgatroyd
 

More from David Murgatroyd (14)

Mission-Driven Machine Learning
Mission-Driven Machine LearningMission-Driven Machine Learning
Mission-Driven Machine Learning
 
Leveraging AI the Right Way (for Product Managers)
Leveraging AI the Right Way (for Product Managers)Leveraging AI the Right Way (for Product Managers)
Leveraging AI the Right Way (for Product Managers)
 
Managing Your Machine Learning Portfolio
Managing Your Machine Learning PortfolioManaging Your Machine Learning Portfolio
Managing Your Machine Learning Portfolio
 
How to train your product owner
How to train your product ownerHow to train your product owner
How to train your product owner
 
Technology & Faith: from Coding to Culture
Technology & Faith: from Coding to CultureTechnology & Faith: from Coding to Culture
Technology & Faith: from Coding to Culture
 
Agile Deep Learning
Agile Deep LearningAgile Deep Learning
Agile Deep Learning
 
Choosing a Job for the Right Reasons
Choosing a Job for the Right ReasonsChoosing a Job for the Right Reasons
Choosing a Job for the Right Reasons
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
 
NLP in the Real World
NLP in the Real WorldNLP in the Real World
NLP in the Real World
 
System combination for HLT
System combination for HLTSystem combination for HLT
System combination for HLT
 
HltCon overview
HltCon overviewHltCon overview
HltCon overview
 
Linguistic Considerations of Identity Resolution (2008)
Linguistic Considerations of Identity Resolution (2008)Linguistic Considerations of Identity Resolution (2008)
Linguistic Considerations of Identity Resolution (2008)
 
Moving beyond-entity-extraction-to-entity-resolution-david-murgatroyd-human-l...
Moving beyond-entity-extraction-to-entity-resolution-david-murgatroyd-human-l...Moving beyond-entity-extraction-to-entity-resolution-david-murgatroyd-human-l...
Moving beyond-entity-extraction-to-entity-resolution-david-murgatroyd-human-l...
 
From Research to Reality: Advances in HLT 2013
From Research to Reality: Advances in HLT 2013From Research to Reality: Advances in HLT 2013
From Research to Reality: Advances in HLT 2013
 

Recently uploaded

ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
Hironori Washizaki
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
Hornet Dynamics
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
lorraineandreiamcidl
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 

Recently uploaded (20)

ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 

Simple fuzzy name matching in solr

  • 1. Simple Fuzzy Name Matching in Solr April 23, 2015 David Murgatroyd @dmurga VP Engineering
  • 2. Quick survey: How many of us... ● Regularly develop Solr applications? ● Develop Solr applications that include names of… ○ ...People? ○ ...Places? ○ ...Products? ○ ...Organizations? ○ …(other entity types)? ● Have names in languages beside English? ● Want to have better name search?
  • 3. Motivating Questions... ● How could a border officer know whether you’re on a terrorist watch list? ● How does your bank know if you’re wiring money to a drug lord? ● How can an ecommerce site treat “Ho-medics Ultra sonic” and “Homedics Ultrasconic” as the same thing?
  • 5. What kinds of name variation?
  • 6. Best Practice: field per variation type?
  • 7. But what if variations co-occur? “Jesus Alfonso Lopez Diaz” v. “LobEzDiaS, Chuy” ● Reordered. ● Missing initial. ● Two spelling differences ● Nickname for first name. ● Missing space.
  • 8. Can’t a name field type do this? Like… ● Contribute score that reflects phenomena. ● Be part of queries using many field types. ● Have multiple fields per document. ● Have multiple values per field.
  • 10. How could you use such a Field? ● Plugin contains custom field type which does all the work behind the scenes ● Simple addition to schema.xml to include new fieldType <fieldType name="rni_name" class="com.basistech.rni.solr.NameField"/> <field name="name" type="rni_name" indexed="true" stored="true" multiValued="false"/> <field name="aka" type="rni_name" indexed="true" stored="true" multiValued="true"/>
  • 11. What happens at index time? ● NameField indexes keys for different phenomena in separate (sub) fields List<IndexableField> createFields(SchemaField field, String name) { Collection<FieldSpec> nameFields = deriveFieldsForName(name); List<IndexableField> docFields = new ArrayList<>(); for (FieldSpec fs : nameFields) { docFields.add(new Field(fs.getName(), fs.getStringValue(), fs.getLuceneField())); } docFields.add(createDocValues(field.getName(), new Name(name))); return docFields; }
  • 13. What happens at query time? ● Step #1: NameField generates analogous keys for a custom Lucene query that finds good candidates for re-ranking public Query getFieldQuery(QParser parser, SchemaField field, String val) { Name name = parseNameString(externalVal, parser.getParams()); QuerySpec querySpec = buildQuery(name); return querySpec.accept(new SolrQueryVisitor(field.getName())); }
  • 14. What else happens at query time? ● Step #2: Uses Solr’s Rerank feature to rescore names in top documents and reorder accordingly ○ &rq={!rniRerank reRankQuery=$rrq reRankWeight=1 reRankMode=replace} &rrq={!func}rniMatch(name, "LobEzDiaS, Chuy A.") ○ Tuned for high precision ○ Simple addition to solrconfig.xml <queryParser name="rniRerank" class="com.basistech.rni.solr.RNIReRankQParserPlugin"/> <valueSourceParser name="rniMatch" class="com.basistech.rni.solr.NameMatchValueSourceParse r"/>
  • 15. How does that work? ● The NameMatchValueSourceParser parses the 'rniMatch' rerank query and returns a function that scores the query name against the indexed names public ValueSource parse(FunctionQParser fp) throws SyntaxError { List<ValueSource> sources = fp.parseValueSourceList(); ValueSource indexNameFieldSrc = sources.get(0); ValueSource queryNameSrc = sources.get(1); String queryStr = ((LiteralValueSource)queryNameSrcSrc).getValue(); Name qName = NameField.parseNameString(queryStr, fp.getParams()); return new NameMatchFunction(indexNameFieldSrc, qName); }
  • 16. ● The NameMatchFunction returns the highest scoring match in every document that gets reranked public double doubleVal(int doc) { //Get the names indexed as DocValues in this document BytesRef br = new BytesRef(); indexNameValues.bytesVal(doc, br); //Deserialize them into Name objects Name[] indexedNames = NAME_SERIALIZER.bytesToNames(br.bytes); //Match each against the query name and return the highest score Double maxScore = 0.0; for (Name indexName : indexedNames) { Double score = cs.score(indexName); maxScore = Math.max(maxScore, score); } return maxScore; } What does that function do?
  • 17. Rerank Query Main QueryIndexing name:"Robert Smith" dob:2/13/1987 name:"Robert Smith" name_Key1:… name_Key2:… name_Key3:… dob:2/13/1987 User Doc Plug-in Implementation Index q=name:"Bob Smitty" booleanQuery: name_Key1:... name_Key2:... name_Key3:... User Query Reranker rniMatch(name, "Bob Smitty") name:"Robert Smith" dob:2/13/1987 score : .79
  • 19. ● reRankScoreThreshold - Added by Us o Score threshold top doc must meet to be rescored o Tradeoff accuracy vs speed ● reRankDocs ○ Controls how many of the top documents to rescore ○ Tradeoff accuracy vs speed Rerank Params - Speed v. Accuracy
  • 20. Rerank Params - Integration w/Query ● reRankQuery o Calls the NameMatch function to get score o Can query multiple names or other fields ● reRankWeight ○ Controls how much weight is given to name score vs main query ○ Allows user to include queries on other non-name fields ● reRankMode - Added by Us ○ Controls how the rerank score should be combined with main query score ○ Currently 'add' or 'replace'
  • 21. Summary: How it works ● Custom field type ○ Splits a single field into multiple fields covering different phenomena ○ Supports multiple name fields in a document as well as multivalued fields ○ Intercepts the query to inject a custom Lucene query ● Custom rerank function ○ Rescores documents with algorithm specific to name matching ○ Limits intense calculations to only top candidates ○ Highly configurable
  • 22. Suggested Questions: ● Thank David Smiley for helping? (Yes!) ● What if the names are in other text fields? ● What about support in Solr 5.*? ● How did you implement multi-valued fields? ● How does it scale? ● How do you handle names not in English? ● How does this relate to the theme of Entity- Centric Search? ● How do plug-in’s scores relate to Solr scores?

Editor's Notes

  1. “name” in quotes? try to remember that TSA handles airports, not CBP :)
  2. “name” in quotes? try to remember that TSA handles airports, not CBP :)
  3. &rq={!rniRerank reRankQuery=$rrq reRankWeight=1 reRankMode=replace} &rrq={!func}rniMatch(name, "Chuy Lopez A Deyas")
  4. breezed through this too fast to grok
  5. does the audience understand “recall” and “precision”?
  6. mention lots of control, but with reasonable defaults so user tweaks are optional