SlideShare a Scribd company logo
1 of 35
Download to read offline
Fuzzy Entity Matching 
Ken Krugler | President, Scale Unlimited
whoami 
•Ken Krugler, Scale Unlimited - Nevada City, CA 
•Consulting on big data (workflows, search, etc) 
•Training for Hadoop, Cascading, Solr & Cassandra
The Problem
Should I Trust You? 
•When opening a bank account... 
•...what is the applicant's risk? 
! 
•Key is matching person... 
•...to other account info
Matching people 
•I have some information you've provided 
•I need to match against ALL bank data 
•But banks won't exchange their customer info 
•So what can we do?
Early Warning Services 
•Owned by the top 5 US banks 
•Gets data from 800+ financial institutions 
•So they have details on most US bank accounts
Fuzzy Matching
What's a fuzzy match? 
•Match everything that's equivalent 
! 
≅ 
! 
•Match nothing that's different 
≇
Why is it hard? 
•Lots of gray areas in fuzzy matching 
≟ 
! 
•Can't use exact key join 
•So no easy lookup using C* row key 
•Often computationally intensive
Matching People 
•I've got information on lots of people 
•I'm being asked about a specific person 
•How to quickly find all good matches? 
•Not doing batch matching ≟
What's a Good Match? 
•Comparing field values between records 
•Are these two people the same? 
Name Bob Bogus Robert Bogus 
Address 220 3rd Ave 220 3rd Avenue 
City Seattle Seattle 
State WA WA 
ZIP 98104-2608 98104
What about now? 
•Normalization becomes critical 
•How to focus on the important features? 
Name Bob Bogus Robert H. Bogus 
Address Apt 102, 220 3rd Ave 3220 3rd Avenue South 
City Seattle Seattle 
State Washington WA 
ZIP 98104
How do you calc similarity? 
•Calculate degree of similarity for each field (0 -> 1.0) 
•Give each field a weight (these sum to 1.0) 
•Score is sum(fieldN sim * fieldN weight) 
•So score is 0 (nothing in common) to 1.0 (exact dup)
Does that scale? 
•For a given person being matched... 
•You need to compare to every other person 
•Works for a few thousand people 
•Doesn't scale for 100s of millions of people
Search to the Rescue
Search is (fast) similarity 
•Find N most similar docs to this doc (my query) 
•Each doc has multi-dimensional feature vector 
•Each feature (dimension) is a unique word 
•Feature weight is TF * IDF
Cosine Similarity 
•Each document has a term vector 
•E.g. three unique words x, y, z 
•Weight is TF*IDF of each word 
•Calc cosine of angle between 2 vectors 
•That is the similarity score
Cosine sim ≢ match sim 
•Doesn't have same level of sophistication 
•So throw a bigger net to find candidates 
•e.g. get top N*X, assuming at most X matches 
•Then do match similarity calc on this (small) set
So two-step process 
Match 
0.90 
0.50 
0.10 
0.85 
... 
Query: name=“Bob Bogus”^3 
and ssn=“222447777”^10 
and dob=“19600723”^5 
Solr 
Index 
Name SSN DOB 
Bob Bogus 222447777 19610603 
Robert Bogus 193618919 19600723 
Bob Smith 479385821 19600723 
Sam Stealthy 222447777 19930523 
Name SSN DOB 
Bob Bogus 222447777 19600723 
... ... ...
How do you pick N? 
•Can be small, if match sim ≈ search sim 
•If N is too big, it's inefficient 
•If N is too small, you miss matches 
•Tune search to mimic match sim 
•Right tradeoff depends on use case
What is Solr? 
•Enterprise search system, build on top of Lucene 
•Open source project at Apache Software Foundation 
•Scales to billions of documents 
•Highly configurable & customizable 
•Integrated with Cassandra in DSE
Solr Schema 
•Defines set of fields in a document 
•Direct one-to-one mapping with Cassandra columns 
•Fields can be defined with synonyms, etc., etc. 
<fields> 
<field name="key" type="string" indexed="true" stored="true" /> 
<field name="name" type="text" indexed="true" stored="true" /> 
</fields>
DSE Search with Solr
What is DSE with Solr? 
•DSE-specific enhancement to Cassandra 
•Keeps a Solr index in sync with a C* table 
•Indexes distributed to all nodes C* & 
Solr 
C* & 
Solr 
C* & 
Solr 
C* Table 
S* Index 
C* Table 
S* Index 
C* Table 
S* Index
Handy replication & failover 
•Implementation leverages C* replication 
•So you get load balancing, reliability, scalability 
•You can replicate from a regular C* DC to Solr DC 
C* & 
Solr 
C* & 
Solr 
Solr DC C* DC 
C* & 
Solr 
C* C* 
C*
Who builds the index? 
•In background 
•Much slower than 
C* updates 
•Uses existing 
secondary index 
hook 
Secondary 
Index Hook 
Distribute to 
indexing queues 
Logical Rows 
Indexing 
Queue 
Read C* storage row 
back_pressure_threshold_per_core 
max_solr_concurrency_per_core 
Create one Solr doc 
per entry 
Apply 
FieldInputTransformer Update Solr
How fast is it? 
•Writing 170M records ≈ 2.5 hours 
•8 node DSE 4.0 cluster, 8 1TB SSDs on each 
•This is indexing during writes 
•About 15% of index available when writes finish 
•Complete index takes another 12 hours
System Overview
ETL Hadoop Workflow 
•Extract, transform, load 
•Built using Cascading API 
•Parse data, simple normalization 
•Other transformations happen in Solr
Cassandra ingress 
•Reduce tasks in Hadoop talk to C* cluster 
•Using DataStax Java driver for Cassandra 
•Bottleneck is Solr indexing 
•Inserts get throttled when this falls behind 
•But total time less than with deferred indexing
Architectural Diagram 
C* + 
Solr 
C* + 
Solr 
C* + 
Solr 
Hadoop 
Cluster 
Entity 
Matcher API
Ingest performance 
•For max performance, write without reads 
•But how to avoid creating duplicate entries? 
•Set the row key to the hash of searchable fields 
•Accept "near duplicates" in search results 
•Possible to push some Solr load into workflow
Summary
Key points to remember 
•This is for ad hoc requests, not batch deduplication 
•Use search to reduce candidate set, then match 
•Pain is in normalization, matching logic 
•DSE + Solr simplifies architecture & adds goodness
More questions? 
•Feel free to contact me 
•http://www.scaleunlimited.com/contact/ 
•Get training on DSE with Solr 
•http://www.datastax.com/what-we-offer/products-services/ 
training

More Related Content

What's hot

Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologySimple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Lucidworks
 
MongoDB Schema Design (Richard Kreuter's Mongo Berlin preso)
MongoDB Schema Design (Richard Kreuter's Mongo Berlin preso)MongoDB Schema Design (Richard Kreuter's Mongo Berlin preso)
MongoDB Schema Design (Richard Kreuter's Mongo Berlin preso)
MongoDB
 

What's hot (20)

A Survey of Elasticsearch Usage
A Survey of Elasticsearch UsageA Survey of Elasticsearch Usage
A Survey of Elasticsearch Usage
 
Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5
 
Real Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchReal Time search using Spark and Elasticsearch
Real Time search using Spark and Elasticsearch
 
Presentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membasePresentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membase
 
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
 
BigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearchBigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearch
 
Streaming in Scala with Avro
Streaming in Scala with AvroStreaming in Scala with Avro
Streaming in Scala with Avro
 
Webinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphWebinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and Graph
 
A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0
 
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologySimple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
 
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکیDeep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
 
ElasticSearch for data mining
ElasticSearch for data mining ElasticSearch for data mining
ElasticSearch for data mining
 
How & When to Use NoSQL at Websummit Dublin
How & When to Use NoSQL at Websummit DublinHow & When to Use NoSQL at Websummit Dublin
How & When to Use NoSQL at Websummit Dublin
 
Open LSH - september 2014 update
Open LSH  - september 2014 updateOpen LSH  - september 2014 update
Open LSH - september 2014 update
 
NoSQL Introduction
NoSQL IntroductionNoSQL Introduction
NoSQL Introduction
 
MongoDB Schema Design (Richard Kreuter's Mongo Berlin preso)
MongoDB Schema Design (Richard Kreuter's Mongo Berlin preso)MongoDB Schema Design (Richard Kreuter's Mongo Berlin preso)
MongoDB Schema Design (Richard Kreuter's Mongo Berlin preso)
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
 
NoSQL
NoSQLNoSQL
NoSQL
 
NoSQL
NoSQLNoSQL
NoSQL
 

Viewers also liked

Viewers also liked (18)

Cassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat Videos
Cassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat VideosCassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat Videos
Cassandra Day Denver 2014: A Cassandra Data Model for Serving up Cat Videos
 
Become a super modeler
Become a super modelerBecome a super modeler
Become a super modeler
 
IoTMidlands #4 - matthew fox from viridian housing
IoTMidlands #4 - matthew fox from viridian housingIoTMidlands #4 - matthew fox from viridian housing
IoTMidlands #4 - matthew fox from viridian housing
 
C* Summit 2013: Data Modelers Still Have Jobs - Adjusting For the NoSQL Envir...
C* Summit 2013: Data Modelers Still Have Jobs - Adjusting For the NoSQL Envir...C* Summit 2013: Data Modelers Still Have Jobs - Adjusting For the NoSQL Envir...
C* Summit 2013: Data Modelers Still Have Jobs - Adjusting For the NoSQL Envir...
 
Cassandra Day Denver 2014: So, You Want to Use Cassandra?
Cassandra Day Denver 2014: So, You Want to Use Cassandra?Cassandra Day Denver 2014: So, You Want to Use Cassandra?
Cassandra Day Denver 2014: So, You Want to Use Cassandra?
 
Cassandra Day Denver 2014: Transitioning to Cassandra for an Already Giant Pr...
Cassandra Day Denver 2014: Transitioning to Cassandra for an Already Giant Pr...Cassandra Day Denver 2014: Transitioning to Cassandra for an Already Giant Pr...
Cassandra Day Denver 2014: Transitioning to Cassandra for an Already Giant Pr...
 
Cassandra Day Denver 2014: Building Java Applications with Apache Cassandra
Cassandra Day Denver 2014: Building Java Applications with Apache CassandraCassandra Day Denver 2014: Building Java Applications with Apache Cassandra
Cassandra Day Denver 2014: Building Java Applications with Apache Cassandra
 
Cassandra Day Denver 2014: Python & Cassandra Best Friends
Cassandra Day Denver 2014: Python & Cassandra Best FriendsCassandra Day Denver 2014: Python & Cassandra Best Friends
Cassandra Day Denver 2014: Python & Cassandra Best Friends
 
Cassandra Day Denver 2014: Cassandra Anti-Pattern Jeopardy
Cassandra Day Denver 2014: Cassandra Anti-Pattern JeopardyCassandra Day Denver 2014: Cassandra Anti-Pattern Jeopardy
Cassandra Day Denver 2014: Cassandra Anti-Pattern Jeopardy
 
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
 
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...
 
A Shortcut to Awesome: Cassandra Data Modeling By Example (Jon Haddad, The La...
A Shortcut to Awesome: Cassandra Data Modeling By Example (Jon Haddad, The La...A Shortcut to Awesome: Cassandra Data Modeling By Example (Jon Haddad, The La...
A Shortcut to Awesome: Cassandra Data Modeling By Example (Jon Haddad, The La...
 
Cassandra Day Denver 2014: Introduction to Apache Cassandra
Cassandra Day Denver 2014: Introduction to Apache CassandraCassandra Day Denver 2014: Introduction to Apache Cassandra
Cassandra Day Denver 2014: Introduction to Apache Cassandra
 
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
 
Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenNoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
 
Quelles stratégies de Recherche avec Cassandra ?
Quelles stratégies de Recherche avec Cassandra ?Quelles stratégies de Recherche avec Cassandra ?
Quelles stratégies de Recherche avec Cassandra ?
 

Similar to Cassandra Summit 2014: Fuzzy Entity Matching at Scale

Graphs fun vjug2
Graphs fun vjug2Graphs fun vjug2
Graphs fun vjug2
Neo4j
 

Similar to Cassandra Summit 2014: Fuzzy Entity Matching at Scale (20)

Neo4j Training Introduction
Neo4j Training IntroductionNeo4j Training Introduction
Neo4j Training Introduction
 
Lightning Talk: What You Need to Know Before You Shard in 20 Minutes
Lightning Talk: What You Need to Know Before You Shard in 20 MinutesLightning Talk: What You Need to Know Before You Shard in 20 Minutes
Lightning Talk: What You Need to Know Before You Shard in 20 Minutes
 
Sharding why,what,when, how
Sharding   why,what,when, howSharding   why,what,when, how
Sharding why,what,when, how
 
Postgres Vision 2018: Five Sharding Data Models
Postgres Vision 2018: Five Sharding Data ModelsPostgres Vision 2018: Five Sharding Data Models
Postgres Vision 2018: Five Sharding Data Models
 
Test driving Azure Search and DocumentDB
Test driving Azure Search and DocumentDBTest driving Azure Search and DocumentDB
Test driving Azure Search and DocumentDB
 
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache SolrCreating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
 
Neo4j Training Modeling
Neo4j Training ModelingNeo4j Training Modeling
Neo4j Training Modeling
 
Five Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
Five Data Models for Sharding | Nordic PGDay 2018 | Craig KerstiensFive Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
Five Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
 
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
 
Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptx
 
Database Design Disasters
Database Design DisastersDatabase Design Disasters
Database Design Disasters
 
Betabit - syrwag 2018-03-28
Betabit - syrwag 2018-03-28Betabit - syrwag 2018-03-28
Betabit - syrwag 2018-03-28
 
Neo4j Data Science Presentation
Neo4j Data Science PresentationNeo4j Data Science Presentation
Neo4j Data Science Presentation
 
What Your Database Query is Really Doing
What Your Database Query is Really DoingWhat Your Database Query is Really Doing
What Your Database Query is Really Doing
 
Scaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQLScaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQL
 
MongoDB Revised Sharding Guidelines MongoDB 3.x_Kimberly_Wilkins
MongoDB Revised Sharding Guidelines MongoDB 3.x_Kimberly_WilkinsMongoDB Revised Sharding Guidelines MongoDB 3.x_Kimberly_Wilkins
MongoDB Revised Sharding Guidelines MongoDB 3.x_Kimberly_Wilkins
 
50 Shades of Data - how, when and why Big, Fast, Relational, NoSQL, Elastic, ...
50 Shades of Data - how, when and why Big, Fast, Relational, NoSQL, Elastic, ...50 Shades of Data - how, when and why Big, Fast, Relational, NoSQL, Elastic, ...
50 Shades of Data - how, when and why Big, Fast, Relational, NoSQL, Elastic, ...
 
MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, when
 
Graphs fun vjug2
Graphs fun vjug2Graphs fun vjug2
Graphs fun vjug2
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analytics
 

More from DataStax Academy

Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
DataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 

More from DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Cassandra Summit 2014: Fuzzy Entity Matching at Scale

  • 1. Fuzzy Entity Matching Ken Krugler | President, Scale Unlimited
  • 2. whoami •Ken Krugler, Scale Unlimited - Nevada City, CA •Consulting on big data (workflows, search, etc) •Training for Hadoop, Cascading, Solr & Cassandra
  • 4. Should I Trust You? •When opening a bank account... •...what is the applicant's risk? ! •Key is matching person... •...to other account info
  • 5. Matching people •I have some information you've provided •I need to match against ALL bank data •But banks won't exchange their customer info •So what can we do?
  • 6. Early Warning Services •Owned by the top 5 US banks •Gets data from 800+ financial institutions •So they have details on most US bank accounts
  • 8. What's a fuzzy match? •Match everything that's equivalent ! ≅ ! •Match nothing that's different ≇
  • 9. Why is it hard? •Lots of gray areas in fuzzy matching ≟ ! •Can't use exact key join •So no easy lookup using C* row key •Often computationally intensive
  • 10. Matching People •I've got information on lots of people •I'm being asked about a specific person •How to quickly find all good matches? •Not doing batch matching ≟
  • 11. What's a Good Match? •Comparing field values between records •Are these two people the same? Name Bob Bogus Robert Bogus Address 220 3rd Ave 220 3rd Avenue City Seattle Seattle State WA WA ZIP 98104-2608 98104
  • 12. What about now? •Normalization becomes critical •How to focus on the important features? Name Bob Bogus Robert H. Bogus Address Apt 102, 220 3rd Ave 3220 3rd Avenue South City Seattle Seattle State Washington WA ZIP 98104
  • 13. How do you calc similarity? •Calculate degree of similarity for each field (0 -> 1.0) •Give each field a weight (these sum to 1.0) •Score is sum(fieldN sim * fieldN weight) •So score is 0 (nothing in common) to 1.0 (exact dup)
  • 14. Does that scale? •For a given person being matched... •You need to compare to every other person •Works for a few thousand people •Doesn't scale for 100s of millions of people
  • 15. Search to the Rescue
  • 16. Search is (fast) similarity •Find N most similar docs to this doc (my query) •Each doc has multi-dimensional feature vector •Each feature (dimension) is a unique word •Feature weight is TF * IDF
  • 17. Cosine Similarity •Each document has a term vector •E.g. three unique words x, y, z •Weight is TF*IDF of each word •Calc cosine of angle between 2 vectors •That is the similarity score
  • 18. Cosine sim ≢ match sim •Doesn't have same level of sophistication •So throw a bigger net to find candidates •e.g. get top N*X, assuming at most X matches •Then do match similarity calc on this (small) set
  • 19. So two-step process Match 0.90 0.50 0.10 0.85 ... Query: name=“Bob Bogus”^3 and ssn=“222447777”^10 and dob=“19600723”^5 Solr Index Name SSN DOB Bob Bogus 222447777 19610603 Robert Bogus 193618919 19600723 Bob Smith 479385821 19600723 Sam Stealthy 222447777 19930523 Name SSN DOB Bob Bogus 222447777 19600723 ... ... ...
  • 20. How do you pick N? •Can be small, if match sim ≈ search sim •If N is too big, it's inefficient •If N is too small, you miss matches •Tune search to mimic match sim •Right tradeoff depends on use case
  • 21. What is Solr? •Enterprise search system, build on top of Lucene •Open source project at Apache Software Foundation •Scales to billions of documents •Highly configurable & customizable •Integrated with Cassandra in DSE
  • 22. Solr Schema •Defines set of fields in a document •Direct one-to-one mapping with Cassandra columns •Fields can be defined with synonyms, etc., etc. <fields> <field name="key" type="string" indexed="true" stored="true" /> <field name="name" type="text" indexed="true" stored="true" /> </fields>
  • 24. What is DSE with Solr? •DSE-specific enhancement to Cassandra •Keeps a Solr index in sync with a C* table •Indexes distributed to all nodes C* & Solr C* & Solr C* & Solr C* Table S* Index C* Table S* Index C* Table S* Index
  • 25. Handy replication & failover •Implementation leverages C* replication •So you get load balancing, reliability, scalability •You can replicate from a regular C* DC to Solr DC C* & Solr C* & Solr Solr DC C* DC C* & Solr C* C* C*
  • 26. Who builds the index? •In background •Much slower than C* updates •Uses existing secondary index hook Secondary Index Hook Distribute to indexing queues Logical Rows Indexing Queue Read C* storage row back_pressure_threshold_per_core max_solr_concurrency_per_core Create one Solr doc per entry Apply FieldInputTransformer Update Solr
  • 27. How fast is it? •Writing 170M records ≈ 2.5 hours •8 node DSE 4.0 cluster, 8 1TB SSDs on each •This is indexing during writes •About 15% of index available when writes finish •Complete index takes another 12 hours
  • 29. ETL Hadoop Workflow •Extract, transform, load •Built using Cascading API •Parse data, simple normalization •Other transformations happen in Solr
  • 30. Cassandra ingress •Reduce tasks in Hadoop talk to C* cluster •Using DataStax Java driver for Cassandra •Bottleneck is Solr indexing •Inserts get throttled when this falls behind •But total time less than with deferred indexing
  • 31. Architectural Diagram C* + Solr C* + Solr C* + Solr Hadoop Cluster Entity Matcher API
  • 32. Ingest performance •For max performance, write without reads •But how to avoid creating duplicate entries? •Set the row key to the hash of searchable fields •Accept "near duplicates" in search results •Possible to push some Solr load into workflow
  • 34. Key points to remember •This is for ad hoc requests, not batch deduplication •Use search to reduce candidate set, then match •Pain is in normalization, matching logic •DSE + Solr simplifies architecture & adds goodness
  • 35. More questions? •Feel free to contact me •http://www.scaleunlimited.com/contact/ •Get training on DSE with Solr •http://www.datastax.com/what-we-offer/products-services/ training