SlideShare a Scribd company logo
1 of 40
1
Copyright (c) 2014 Scale Unlimited.
Similarity at Scale
Fuzzy matching and recommendations
using Hadoop, Solr, and heuristics
Ken Krugler
Scale Unlimited
2
Copyright (c) 2014 Scale Unlimited.
The Twitter Pitch
Wide class of problems that rely on "good" similarity
Fast
Accurate
Scalable
Benefit from my mistakes
Scale Unlimited - consulting & training
Talking about solutions to real problems
3
Copyright (c) 2014 Scale Unlimited.
What are similarity problems?
Clustering
Grouping similar advertisers
Deduplication
Joining noisy sets of POI data
Recommendations
Suggesting pages to users
Entity resolution
Fuzzy matching of people and companies
4
Copyright (c) 2014 Scale Unlimited.
What is "Similarity"?
Exact matching is easy(er)
Accuracy is a given
Fast and scalable can still be hard
Lots of key/value systems like Cassandra, HBase, etc.
Fuzzy matching is harder
Two "things" aren't exactly the same
Similarity is based on comparing features
5
Copyright (c) 2014 Scale Unlimited.
Between two articles?
Features could be a bag of words
Are these two articles the same?
Bosnia is the largest geographic
region of the modern state with a
moderate continental climate,
marked by hot summers and cold,
snowy winters.
The inland is a geographically
larger region and has a moderate
continental climate, bookended by
hot summers and cold and snowy
winters.
6
Copyright (c) 2014 Scale Unlimited.
What about now?
Easy to create challenging situations for a person
Which is an impossible problem for a computer
Need to distinguish between "conceptually similar" and "derived
from"
Bosnia is the largest geographic
region of the modern state with a
moderate continental climate,
marked by hot summers and cold,
snowy winters.
Bosnia has a warm European
climate, though the summers can
be hot and the winters are often
cold and wet.
7
Copyright (c) 2014 Scale Unlimited.
Between two records?
Features could be field values
Are these two people the same?
Name Bob Bogus Robert Bogus
Address 220 3rd Avenue 220 3rd Avenue
City Seattle Seattle
State WA WA
Zip 98104-2608 98104
8
Copyright (c) 2014 Scale Unlimited.
What about now?
Need to get rid of false differences caused by abbreviations
How does a computer know what's a "significant" difference?
Name Bob Bogus Robert H. Bogus
Address Apt 102, 3220 3rd Ave 220 3rd Avenue South
City Seattle Seattle
State Washington WA
Zip 98104
9
Copyright (c) 2014 Scale Unlimited.
Between two users?
Features could be...
Items a user has bought
Are these two users the same?
User 1 User 2
10
Copyright (c) 2014 Scale Unlimited.
What about now?
Need more generic features
E.g. product categories
User 1 User 2
11
Copyright (c) 2014 Scale Unlimited.
How to measure similarity?
Assuming you have some features for two "things"
How does a program determine their degree of similarity?
You want a number that represents their "closeness"
Typically 1.0 means exactly the same
And 0.0 means completely different
12
Copyright (c) 2014 Scale Unlimited.
Jaccard Coefficient
Ratio of number of items in common / total number of items
Where "items" typical means unique values (sets of things)
So 1.0 is exactly the same, and 0.0 is completely different
13
Copyright (c) 2014 Scale Unlimited.
Cosine Similarity
Assume a document only has three unique words
cat, dog, goldfish
Set x = frequency of cat
Set y = frequency of dog
Set z = frequency of goldfish
The result is a "term vector" with 3 dimensions
Calculate cosine of angle between term vectors
This is their "cosine similarity"
14
Copyright (c) 2014 Scale Unlimited.
Why is scalability hard?
Assume you have 8.5 million businesses in the US
There are N^2/2 pairs to evaluateโ‰ˆ
That's 36 trillion comparisons
Sometimes you can quickly trim this problem
E.g. if you assume the ZIP code exists, and must match
Then this becomes about 4 billion comparisons
But often you don't have a "magic" field
15
Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
DataStax Web
Site Page
Recommender
16
Copyright (c) 2014 Scale Unlimited.
How to recommend pages?
Besides manually adding a bunch of links...
Which is tedious, doesn't scale well, and gets busy
17
Copyright (c) 2014 Scale Unlimited.
Can we exploit other users?
Classic shopping cart analysis
"Users who bought X also bought Y"
Based on actual activity, versus (noisy, skewed) ratings
18
Copyright (c) 2014 Scale Unlimited.
What's the general approach?
We have web logs with IP addresses, time, path to page
157.55.33.39 - - [18/Mar/2014:00:01:00 -0500]
"GET /solutions/nosql HTTP/1.1"
A browsing session is a series of requests from one IP address
With some maximum time gap between requests
Find sessions "similar to" the current user's session
Recommend pages from these similar sessions
19
Copyright (c) 2014 Scale Unlimited.
How to find similar sessions?
Create a Lucene search index with one document per session
Each indexed document contains the page paths for one
session
session-1 /path/to/page1, /path/to/page2, /path/to/page3
session-2 /path/to/pageX, /path/to/pageY
Search for paths from the current user's session
20
Copyright (c) 2014 Scale Unlimited.
Why is this a search issue?
Solr (search in general) is all about similarity
Find documents similar to the words in my query
Cosine similarity is used to calculate similarity
Between the term vector for my query
and the term vector of each document
21
Copyright (c) 2014 Scale Unlimited.
What's the algorithm?
Find sessions similar to the target (current user's) session
Calculate similarity between these sessions and the target
session
Aggregate similarity scores for all paths from these sessions
Remove paths that are already in the target session
Recommend the highest scoring path(s)
22
Copyright (c) 2014 Scale Unlimited.
Why do you sum similarities?
Give more weight to pages from sessions that are more similar
Pages from more similar sessions are assumed to be more
interesting
23
Copyright (c) 2014 Scale Unlimited.
What are some problems?
The classic problem is that we recommend "common" pages
E.g. if you haven't viewed the top-level page in your session
But this page is very common in most of the other sessions
So then it becomes one of the top recommended page
But that generally stinks as a recommendation
24
Copyright (c) 2014 Scale Unlimited.
Can RowSimilarityJob help?
Part of the Mahout open source project
Takes as input a table of users (one per row) with lists of items
Generates an item-item co-occurrence matrix
Values are weights calculated using log-likelihood ratio (LLR)
Unsurprising (common) items get low weights
If we run it on our data, where users = sessions and items =
pages
We get page-page co-occurrence matrix Page 1 Page 2 Page 3
Page 1 2.1 0.8
Page 2 2.1 4.5
Page 3 0.8 4.5
25
Copyright (c) 2014 Scale Unlimited.
How to use co-occurrence?
Convert the matrix into an index
Each row is one Lucene document
Drop any low-scoring entries
Create list of "related" pages
Search in Related Pages field
Using pages from current session
So Page 2 recommends Page 1 & 3
Page 1 Page 2 Page 3
Page 1 2.1 0.8
Page 2 2.1 4.5
Page 3 0.8 4.5
Related Pages
Page 1 Page 2
Page 2 Page 1, Page 3
Page 3 Page 2
26
Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
EWS
Entity
Resolution
Entity
Resolution
27
Copyright (c) 2014 Scale Unlimited.
What is Early Warning?
Early Warning helps banks fight fraud
It's owned by the top 5 US banks
And gets data from 800+ financial institutions
So they have details on most US bank accounts
When somebody signs up for an account
They need to quickly match the person to "known entities"
And derive a risk score based on related account details
28
Copyright (c) 2014 Scale Unlimited.
Why do they need similarity?
Assume you have information on 100s of millions of entities
Name(s), address(es), phone number(s), etc.
And often a unique ID (Social Security Number, EIN, etc)
Why is this a similarity problem?
Data is noisy - typos, abbreviations, partial data
People lie - much fraud starts with opening an account using bad
data
29
Copyright (c) 2014 Scale Unlimited.
How does search help?
We can quickly build a list of candidate entities, using search
Query contains field data provided by the client bank
Significantly less than 1 second for 30 candidate entities
Then do more precise, sophisticated and CPU-intensive scoring
The end result is a ranked list of entities with similarity scores
Which then is used to look up account status, fraud cases, etc.
30
Copyright (c) 2014 Scale Unlimited.
What's the data pipeline?
Incoming data is cleaned up/normalized in Hadoop
Simple things like space stripping
Also phone number formatting
ZIP+4 expansion into just ZIP plus full
Other normalization happens inside of Solr
This gets loaded into Cassandra tables
And automatically indexed by Solr, via DataStax Enterprise
ZIP+4 Terms
95014-2127 95014, 2127
Phone Terms
4805551212 480, 5551212
31
Copyright (c) 2014 Scale Unlimited.
What's the Solr setup?
Each field in the index has very specific analysis
Simple things like normalization
Synonym expansion for names, abbreviations
Split up fields so partial matches work
At query time we can weight the importance of each field
Which helps order the top N candidates similar to their real match
scores
E.g. an SSN matching means much more than a first name
matching
32
Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Batch Similarity
33
Copyright (c) 2014 Scale Unlimited.
Can we do batch similarity?
Search works well for real-time similarity
But batch processing at scale maxes out the search system
We can use two different techniques with Hadoop for batch
SimHash - good for text document similarity
Parallel Set-Similarity Joins - good for record similarity
34
Copyright (c) 2014 Scale Unlimited.
What is SimHash?
Assume a document is a set of (unique) words
Calculate a hash for each word
Probability that the minimum hash is the same for two
documents...
...is magically equal to the Jaccard Coefficient
Term Hash
bosnia 78954874223
is 53466156768
the 5064199193
largest 3193621783
geographic -5718349925
35
Copyright (c) 2014 Scale Unlimited.
What is a SimHash workflow?
Calculate N hash values
Easy way is to use the N smallest hash values
Calculate number of matching hash values between doc pairs
(M)
Then the Jaccard Coefficient is M/Nโ‰ˆ
Only works if N is much smaller than # of unique words in docs
Implementation of this in cascading.utils open source project
https://github.com/ScaleUnlimited/cascading.utils
36
Copyright (c) 2014 Scale Unlimited.
What is Set-Similarity Join?
Joining records in two sets that are "close enough"
aka "fuzzy join"
Requires generation of "tokens" from record field(s)
Typically words from text
Simple implementation has three phases
First calculate counts for each unique token value
Then output <token, record> for N most common tokens of each
record
Group by token, compare records in each group
37
Copyright (c) 2014 Scale Unlimited.
How does fuzzy join work?
For two records to be "similar enough"...
They need to share one of their common tokens
Generalization of the ZIP code "magic field" approach
Basic implementation has a number of issues
Passing around copies of full record is inefficient
Too-common tokens create huge groups for comparison
Two records compared multiple times
38
Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Summary
39
Copyright (c) 2014 Scale Unlimited.
The Net-Net
Similarity is a common requirement for many applications
Recommendations
Entity matching
Combining Hadoop with search is a powerful combination
Scalability
Performance
Flexibility
40
Copyright (c) 2014 Scale Unlimited.
Questions?
Feel free to contact me
http://www.scaleunlimited.com/contact/
Take a look at Pat Ferrel's Hadoop + Solr recommender
http://github.com/pferrel/solr-recommender
Check out Mahout
http://mahout.apache.org
Read paper & code for fuzzyjoin project
http://asterix.ics.uci.edu/fuzzyjoin/

More Related Content

What's hot

MongoDB Schema Design (Richard Kreuter's Mongo Berlin preso)
MongoDB Schema Design (Richard Kreuter's Mongo Berlin preso)MongoDB Schema Design (Richard Kreuter's Mongo Berlin preso)
MongoDB Schema Design (Richard Kreuter's Mongo Berlin preso)
MongoDB
ย 
Modeling Data in MongoDB
Modeling Data in MongoDBModeling Data in MongoDB
Modeling Data in MongoDB
lehresman
ย 
An Introduction to Graph Databases
An Introduction to Graph DatabasesAn Introduction to Graph Databases
An Introduction to Graph Databases
InfiniteGraph
ย 
Windy City DB - Recommendation Engine with Neo4j
Windy City DB - Recommendation Engine with Neo4jWindy City DB - Recommendation Engine with Neo4j
Windy City DB - Recommendation Engine with Neo4j
Max De Marzi
ย 
NoSQL: Why, When, and How
NoSQL: Why, When, and HowNoSQL: Why, When, and How
NoSQL: Why, When, and How
BigBlueHat
ย 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
MongoDB
ย 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
Harini Sirisena
ย 

What's hot (20)

MongoDB Schema Design (Richard Kreuter's Mongo Berlin preso)
MongoDB Schema Design (Richard Kreuter's Mongo Berlin preso)MongoDB Schema Design (Richard Kreuter's Mongo Berlin preso)
MongoDB Schema Design (Richard Kreuter's Mongo Berlin preso)
ย 
A Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
A Multifaceted Look At Faceting - Ted Sullivan, LucidworksA Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
A Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
ย 
Data modeling with neo4j tutorial
Data modeling with neo4j tutorialData modeling with neo4j tutorial
Data modeling with neo4j tutorial
ย 
Modeling Data in MongoDB
Modeling Data in MongoDBModeling Data in MongoDB
Modeling Data in MongoDB
ย 
An Introduction to Graph Databases
An Introduction to Graph DatabasesAn Introduction to Graph Databases
An Introduction to Graph Databases
ย 
Graph Database
Graph DatabaseGraph Database
Graph Database
ย 
Windy City DB - Recommendation Engine with Neo4j
Windy City DB - Recommendation Engine with Neo4jWindy City DB - Recommendation Engine with Neo4j
Windy City DB - Recommendation Engine with Neo4j
ย 
Neo4j - graph database for recommendations
Neo4j - graph database for recommendationsNeo4j - graph database for recommendations
Neo4j - graph database for recommendations
ย 
NoSQL: Why, When, and How
NoSQL: Why, When, and HowNoSQL: Why, When, and How
NoSQL: Why, When, and How
ย 
Non Relational Databases
Non Relational DatabasesNon Relational Databases
Non Relational Databases
ย 
Graph database
Graph database Graph database
Graph database
ย 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
ย 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
ย 
Relational to Graph - Import
Relational to Graph - ImportRelational to Graph - Import
Relational to Graph - Import
ย 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big data
ย 
Performance of graph query languages
Performance of graph query languagesPerformance of graph query languages
Performance of graph query languages
ย 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
ย 
NoSQL with ASP.NET MVC
NoSQL with ASP.NET MVCNoSQL with ASP.NET MVC
NoSQL with ASP.NET MVC
ย 
Conceptos bรกsicos. Seminario web 1: Introducciรณn a NoSQL
Conceptos bรกsicos. Seminario web 1: Introducciรณn a NoSQLConceptos bรกsicos. Seminario web 1: Introducciรณn a NoSQL
Conceptos bรกsicos. Seminario web 1: Introducciรณn a NoSQL
ย 
ETL into Neo4j
ETL into Neo4jETL into Neo4j
ETL into Neo4j
ย 

Viewers also liked

Guide To Outsourcing 200803
Guide To Outsourcing 200803Guide To Outsourcing 200803
Guide To Outsourcing 200803
euweben01
ย 
Serap Mutlu Akbulut Korosu 1 Mart 2016 Konser Resimleri
Serap Mutlu Akbulut Korosu 1 Mart 2016 Konser ResimleriSerap Mutlu Akbulut Korosu 1 Mart 2016 Konser Resimleri
Serap Mutlu Akbulut Korosu 1 Mart 2016 Konser Resimleri
aokutur
ย 
Presidential Jeopardy
Presidential JeopardyPresidential Jeopardy
Presidential Jeopardy
guest15f33e
ย 
Intro Ppt
Intro PptIntro Ppt
Intro Ppt
ben40176
ย 
The Buddha
The  BuddhaThe  Buddha
The Buddha
t0nywilliams
ย 
ะšะฐะบ ะทะฐะฟัƒัั‚ะธั‚ัŒ ั€ะตะบะปะฐะผะฝัƒัŽ ะบะฐะผะฟะฐะฝะธัŽ?
ะšะฐะบ ะทะฐะฟัƒัั‚ะธั‚ัŒ ั€ะตะบะปะฐะผะฝัƒัŽ ะบะฐะผะฟะฐะฝะธัŽ?ะšะฐะบ ะทะฐะฟัƒัั‚ะธั‚ัŒ ั€ะตะบะปะฐะผะฝัƒัŽ ะบะฐะผะฟะฐะฝะธัŽ?
ะšะฐะบ ะทะฐะฟัƒัั‚ะธั‚ัŒ ั€ะตะบะปะฐะผะฝัƒัŽ ะบะฐะผะฟะฐะฝะธัŽ?
Ingria. Technopark St. Petersburg
ย 
Presentation1
Presentation1Presentation1
Presentation1
guest182d1f
ย 

Viewers also liked (20)

Faster Workflows, Faster
Faster Workflows, FasterFaster Workflows, Faster
Faster Workflows, Faster
ย 
BM25 Scoring for Lucene: From Academia to Industry
BM25 Scoring for Lucene: From Academia to IndustryBM25 Scoring for Lucene: From Academia to Industry
BM25 Scoring for Lucene: From Academia to Industry
ย 
Suicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and CassandraSuicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and Cassandra
ย 
Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. Similarity
ย 
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
ย 
Hippopotamus
Hippopotamus Hippopotamus
Hippopotamus
ย 
Guide To Outsourcing 200803
Guide To Outsourcing 200803Guide To Outsourcing 200803
Guide To Outsourcing 200803
ย 
Serap Mutlu Akbulut Korosu 1 Mart 2016 Konser Resimleri
Serap Mutlu Akbulut Korosu 1 Mart 2016 Konser ResimleriSerap Mutlu Akbulut Korosu 1 Mart 2016 Konser Resimleri
Serap Mutlu Akbulut Korosu 1 Mart 2016 Konser Resimleri
ย 
Multi-Lingual Multi-Site SEO Tips
Multi-Lingual Multi-Site SEO TipsMulti-Lingual Multi-Site SEO Tips
Multi-Lingual Multi-Site SEO Tips
ย 
Lion
LionLion
Lion
ย 
Cokecola
CokecolaCokecola
Cokecola
ย 
Presidential Jeopardy
Presidential JeopardyPresidential Jeopardy
Presidential Jeopardy
ย 
Intro Ppt
Intro PptIntro Ppt
Intro Ppt
ย 
Triage Presentation
Triage PresentationTriage Presentation
Triage Presentation
ย 
The Buddha
The  BuddhaThe  Buddha
The Buddha
ย 
ะšะฐะบ ะทะฐะฟัƒัั‚ะธั‚ัŒ ั€ะตะบะปะฐะผะฝัƒัŽ ะบะฐะผะฟะฐะฝะธัŽ?
ะšะฐะบ ะทะฐะฟัƒัั‚ะธั‚ัŒ ั€ะตะบะปะฐะผะฝัƒัŽ ะบะฐะผะฟะฐะฝะธัŽ?ะšะฐะบ ะทะฐะฟัƒัั‚ะธั‚ัŒ ั€ะตะบะปะฐะผะฝัƒัŽ ะบะฐะผะฟะฐะฝะธัŽ?
ะšะฐะบ ะทะฐะฟัƒัั‚ะธั‚ัŒ ั€ะตะบะปะฐะผะฝัƒัŽ ะบะฐะผะฟะฐะฝะธัŽ?
ย 
Jhing Chik Jhing
Jhing Chik Jhing Jhing Chik Jhing
Jhing Chik Jhing
ย 
Adaptive Management
Adaptive ManagementAdaptive Management
Adaptive Management
ย 
Chimpanzee
ChimpanzeeChimpanzee
Chimpanzee
ย 
Presentation1
Presentation1Presentation1
Presentation1
ย 

Similar to Similarity at scale

Similarity at Scale
Similarity at ScaleSimilarity at Scale
Similarity at Scale
DataWorks Summit
ย 
Spivack Blogtalk 2008
Spivack Blogtalk 2008Spivack Blogtalk 2008
Spivack Blogtalk 2008
Blogtalk 2008
ย 
1190 Assignment Specifications1190.12 25 Points PossibleC.docx
1190 Assignment Specifications1190.12 25  Points PossibleC.docx1190 Assignment Specifications1190.12 25  Points PossibleC.docx
1190 Assignment Specifications1190.12 25 Points PossibleC.docx
RAJU852744
ย 
1190 Assignment Specifications1190.12 25 Points PossibleC.docx
1190 Assignment Specifications1190.12 25  Points PossibleC.docx1190 Assignment Specifications1190.12 25  Points PossibleC.docx
1190 Assignment Specifications1190.12 25 Points PossibleC.docx
aulasnilda
ย 
Chapter 11Data Visualization and Geographic Information System.docx
Chapter 11Data Visualization and Geographic Information System.docxChapter 11Data Visualization and Geographic Information System.docx
Chapter 11Data Visualization and Geographic Information System.docx
cravennichole326
ย 
Chapter 11Data Visualization and Geographic Information System.docx
Chapter 11Data Visualization and Geographic Information System.docxChapter 11Data Visualization and Geographic Information System.docx
Chapter 11Data Visualization and Geographic Information System.docx
keturahhazelhurst
ย 
Chapter 11Data Visualization and Geographic Information System.docx
Chapter 11Data Visualization and Geographic Information System.docxChapter 11Data Visualization and Geographic Information System.docx
Chapter 11Data Visualization and Geographic Information System.docx
bartholomeocoombs
ย 

Similar to Similarity at scale (20)

Similarity at Scale
Similarity at ScaleSimilarity at Scale
Similarity at Scale
ย 
Aen007 Kenigsberg 091807
Aen007 Kenigsberg 091807Aen007 Kenigsberg 091807
Aen007 Kenigsberg 091807
ย 
Spivack Blogtalk 2008
Spivack Blogtalk 2008Spivack Blogtalk 2008
Spivack Blogtalk 2008
ย 
[DF2U] Data Management: Thereโ€™s gold in them details
[DF2U] Data Management: Thereโ€™s gold in them details[DF2U] Data Management: Thereโ€™s gold in them details
[DF2U] Data Management: Thereโ€™s gold in them details
ย 
Using Search Analytics to Diagnose Whatโ€™s Ailing your Information Architecture
Using Search Analytics to Diagnose Whatโ€™s Ailing your Information ArchitectureUsing Search Analytics to Diagnose Whatโ€™s Ailing your Information Architecture
Using Search Analytics to Diagnose Whatโ€™s Ailing your Information Architecture
ย 
Semantic Applications for Financial Services
Semantic Applications for Financial ServicesSemantic Applications for Financial Services
Semantic Applications for Financial Services
ย 
Calin Constantinov - Neo4j - Keyboards and Mice - Craiova 2016
Calin Constantinov - Neo4j - Keyboards and Mice - Craiova 2016Calin Constantinov - Neo4j - Keyboards and Mice - Craiova 2016
Calin Constantinov - Neo4j - Keyboards and Mice - Craiova 2016
ย 
1190 Assignment Specifications1190.12 25 Points PossibleC.docx
1190 Assignment Specifications1190.12 25  Points PossibleC.docx1190 Assignment Specifications1190.12 25  Points PossibleC.docx
1190 Assignment Specifications1190.12 25 Points PossibleC.docx
ย 
1190 Assignment Specifications1190.12 25 Points PossibleC.docx
1190 Assignment Specifications1190.12 25  Points PossibleC.docx1190 Assignment Specifications1190.12 25  Points PossibleC.docx
1190 Assignment Specifications1190.12 25 Points PossibleC.docx
ย 
Search Analytics for Fun and Profit
Search Analytics for Fun and ProfitSearch Analytics for Fun and Profit
Search Analytics for Fun and Profit
ย 
Chapter 11Data Visualization and Geographic Information System.docx
Chapter 11Data Visualization and Geographic Information System.docxChapter 11Data Visualization and Geographic Information System.docx
Chapter 11Data Visualization and Geographic Information System.docx
ย 
Chapter 11Data Visualization and Geographic Information System.docx
Chapter 11Data Visualization and Geographic Information System.docxChapter 11Data Visualization and Geographic Information System.docx
Chapter 11Data Visualization and Geographic Information System.docx
ย 
Chapter 11Data Visualization and Geographic Information System.docx
Chapter 11Data Visualization and Geographic Information System.docxChapter 11Data Visualization and Geographic Information System.docx
Chapter 11Data Visualization and Geographic Information System.docx
ย 
Advanced Analysis Presentation
Advanced Analysis PresentationAdvanced Analysis Presentation
Advanced Analysis Presentation
ย 
RDBMS to Graph Webinar
RDBMS to Graph WebinarRDBMS to Graph Webinar
RDBMS to Graph Webinar
ย 
Moving to a web based crm database or fundraising system
Moving to a web based crm database or fundraising systemMoving to a web based crm database or fundraising system
Moving to a web based crm database or fundraising system
ย 
Hh
HhHh
Hh
ย 
Competitive Keyword Research For SEO
Competitive Keyword Research For SEOCompetitive Keyword Research For SEO
Competitive Keyword Research For SEO
ย 
Data science in the noc and beyond
Data science in the noc and beyondData science in the noc and beyond
Data science in the noc and beyond
ย 
Search Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your CustomersSearch Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your Customers
ย 

More from Ken Krugler

More from Ken Krugler (7)

Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrFaster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
ย 
Strata web mining tutorial
Strata web mining tutorialStrata web mining tutorial
Strata web mining tutorial
ย 
A (very) short intro to Hadoop
A (very) short intro to HadoopA (very) short intro to Hadoop
A (very) short intro to Hadoop
ย 
A (very) short history of big data
A (very) short history of big dataA (very) short history of big data
A (very) short history of big data
ย 
Thinking at scale with hadoop
Thinking at scale with hadoopThinking at scale with hadoop
Thinking at scale with hadoop
ย 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web Mining
ย 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web Mining
ย 

Recently uploaded

Escorts Service Kumaraswamy Layout โ˜Ž 7737669865โ˜Ž Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout โ˜Ž 7737669865โ˜Ž Book Your One night Stand (B...Escorts Service Kumaraswamy Layout โ˜Ž 7737669865โ˜Ž Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout โ˜Ž 7737669865โ˜Ž Book Your One night Stand (B...
amitlee9823
ย 
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Sambalpur Call-girls in Women Seeking Men ๐Ÿ”Sambalpur๐Ÿ” Esc...
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Sambalpur Call-girls in Women Seeking Men  ๐Ÿ”Sambalpur๐Ÿ”   Esc...โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Sambalpur Call-girls in Women Seeking Men  ๐Ÿ”Sambalpur๐Ÿ”   Esc...
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Sambalpur Call-girls in Women Seeking Men ๐Ÿ”Sambalpur๐Ÿ” Esc...
amitlee9823
ย 
Call Girls Jalahalli Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ban...
amitlee9823
ย 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
ย 
CHEAP Call Girls in Saket (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
ย 
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Ongole Call-girls in Women Seeking Men ๐Ÿ”Ongole๐Ÿ” Escorts S...
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Ongole Call-girls in Women Seeking Men  ๐Ÿ”Ongole๐Ÿ”   Escorts S...โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Ongole Call-girls in Women Seeking Men  ๐Ÿ”Ongole๐Ÿ”   Escorts S...
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Ongole Call-girls in Women Seeking Men ๐Ÿ”Ongole๐Ÿ” Escorts S...
amitlee9823
ย 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
SUHANI PANDEY
ย 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
ย 
Call Girls Bommasandra Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service B...
Call Girls Bommasandra Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service B...Call Girls Bommasandra Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service B...
Call Girls Bommasandra Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service B...
amitlee9823
ย 
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป mahisagar Call-girls in Women Seeking Men ๐Ÿ”mahisagar๐Ÿ” Esc...
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป mahisagar Call-girls in Women Seeking Men  ๐Ÿ”mahisagar๐Ÿ”   Esc...โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป mahisagar Call-girls in Women Seeking Men  ๐Ÿ”mahisagar๐Ÿ”   Esc...
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป mahisagar Call-girls in Women Seeking Men ๐Ÿ”mahisagar๐Ÿ” Esc...
amitlee9823
ย 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
ย 
Just Call Vip call girls Mysore Escorts โ˜Ž๏ธ9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts โ˜Ž๏ธ9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts โ˜Ž๏ธ9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts โ˜Ž๏ธ9352988975 Two shot with one girl (...
gajnagarg
ย 
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Dindigul Call-girls in Women Seeking Men ๐Ÿ”Dindigul๐Ÿ” Escor...
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Dindigul Call-girls in Women Seeking Men  ๐Ÿ”Dindigul๐Ÿ”   Escor...โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Dindigul Call-girls in Women Seeking Men  ๐Ÿ”Dindigul๐Ÿ”   Escor...
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Dindigul Call-girls in Women Seeking Men ๐Ÿ”Dindigul๐Ÿ” Escor...
amitlee9823
ย 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
ย 

Recently uploaded (20)

Escorts Service Kumaraswamy Layout โ˜Ž 7737669865โ˜Ž Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout โ˜Ž 7737669865โ˜Ž Book Your One night Stand (B...Escorts Service Kumaraswamy Layout โ˜Ž 7737669865โ˜Ž Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout โ˜Ž 7737669865โ˜Ž Book Your One night Stand (B...
ย 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
ย 
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Sambalpur Call-girls in Women Seeking Men ๐Ÿ”Sambalpur๐Ÿ” Esc...
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Sambalpur Call-girls in Women Seeking Men  ๐Ÿ”Sambalpur๐Ÿ”   Esc...โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Sambalpur Call-girls in Women Seeking Men  ๐Ÿ”Sambalpur๐Ÿ”   Esc...
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Sambalpur Call-girls in Women Seeking Men ๐Ÿ”Sambalpur๐Ÿ” Esc...
ย 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
ย 
Call Girls Jalahalli Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ban...
ย 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
ย 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
ย 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
ย 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
ย 
CHEAP Call Girls in Saket (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
ย 
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Ongole Call-girls in Women Seeking Men ๐Ÿ”Ongole๐Ÿ” Escorts S...
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Ongole Call-girls in Women Seeking Men  ๐Ÿ”Ongole๐Ÿ”   Escorts S...โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Ongole Call-girls in Women Seeking Men  ๐Ÿ”Ongole๐Ÿ”   Escorts S...
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Ongole Call-girls in Women Seeking Men ๐Ÿ”Ongole๐Ÿ” Escorts S...
ย 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
ย 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ย 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
ย 
Call Girls Bommasandra Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service B...
Call Girls Bommasandra Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service B...Call Girls Bommasandra Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service B...
Call Girls Bommasandra Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service B...
ย 
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป mahisagar Call-girls in Women Seeking Men ๐Ÿ”mahisagar๐Ÿ” Esc...
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป mahisagar Call-girls in Women Seeking Men  ๐Ÿ”mahisagar๐Ÿ”   Esc...โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป mahisagar Call-girls in Women Seeking Men  ๐Ÿ”mahisagar๐Ÿ”   Esc...
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป mahisagar Call-girls in Women Seeking Men ๐Ÿ”mahisagar๐Ÿ” Esc...
ย 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
ย 
Just Call Vip call girls Mysore Escorts โ˜Ž๏ธ9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts โ˜Ž๏ธ9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts โ˜Ž๏ธ9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts โ˜Ž๏ธ9352988975 Two shot with one girl (...
ย 
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Dindigul Call-girls in Women Seeking Men ๐Ÿ”Dindigul๐Ÿ” Escor...
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Dindigul Call-girls in Women Seeking Men  ๐Ÿ”Dindigul๐Ÿ”   Escor...โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Dindigul Call-girls in Women Seeking Men  ๐Ÿ”Dindigul๐Ÿ”   Escor...
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Dindigul Call-girls in Women Seeking Men ๐Ÿ”Dindigul๐Ÿ” Escor...
ย 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
ย 

Similarity at scale

  • 1. 1 Copyright (c) 2014 Scale Unlimited. Similarity at Scale Fuzzy matching and recommendations using Hadoop, Solr, and heuristics Ken Krugler Scale Unlimited
  • 2. 2 Copyright (c) 2014 Scale Unlimited. The Twitter Pitch Wide class of problems that rely on "good" similarity Fast Accurate Scalable Benefit from my mistakes Scale Unlimited - consulting & training Talking about solutions to real problems
  • 3. 3 Copyright (c) 2014 Scale Unlimited. What are similarity problems? Clustering Grouping similar advertisers Deduplication Joining noisy sets of POI data Recommendations Suggesting pages to users Entity resolution Fuzzy matching of people and companies
  • 4. 4 Copyright (c) 2014 Scale Unlimited. What is "Similarity"? Exact matching is easy(er) Accuracy is a given Fast and scalable can still be hard Lots of key/value systems like Cassandra, HBase, etc. Fuzzy matching is harder Two "things" aren't exactly the same Similarity is based on comparing features
  • 5. 5 Copyright (c) 2014 Scale Unlimited. Between two articles? Features could be a bag of words Are these two articles the same? Bosnia is the largest geographic region of the modern state with a moderate continental climate, marked by hot summers and cold, snowy winters. The inland is a geographically larger region and has a moderate continental climate, bookended by hot summers and cold and snowy winters.
  • 6. 6 Copyright (c) 2014 Scale Unlimited. What about now? Easy to create challenging situations for a person Which is an impossible problem for a computer Need to distinguish between "conceptually similar" and "derived from" Bosnia is the largest geographic region of the modern state with a moderate continental climate, marked by hot summers and cold, snowy winters. Bosnia has a warm European climate, though the summers can be hot and the winters are often cold and wet.
  • 7. 7 Copyright (c) 2014 Scale Unlimited. Between two records? Features could be field values Are these two people the same? Name Bob Bogus Robert Bogus Address 220 3rd Avenue 220 3rd Avenue City Seattle Seattle State WA WA Zip 98104-2608 98104
  • 8. 8 Copyright (c) 2014 Scale Unlimited. What about now? Need to get rid of false differences caused by abbreviations How does a computer know what's a "significant" difference? Name Bob Bogus Robert H. Bogus Address Apt 102, 3220 3rd Ave 220 3rd Avenue South City Seattle Seattle State Washington WA Zip 98104
  • 9. 9 Copyright (c) 2014 Scale Unlimited. Between two users? Features could be... Items a user has bought Are these two users the same? User 1 User 2
  • 10. 10 Copyright (c) 2014 Scale Unlimited. What about now? Need more generic features E.g. product categories User 1 User 2
  • 11. 11 Copyright (c) 2014 Scale Unlimited. How to measure similarity? Assuming you have some features for two "things" How does a program determine their degree of similarity? You want a number that represents their "closeness" Typically 1.0 means exactly the same And 0.0 means completely different
  • 12. 12 Copyright (c) 2014 Scale Unlimited. Jaccard Coefficient Ratio of number of items in common / total number of items Where "items" typical means unique values (sets of things) So 1.0 is exactly the same, and 0.0 is completely different
  • 13. 13 Copyright (c) 2014 Scale Unlimited. Cosine Similarity Assume a document only has three unique words cat, dog, goldfish Set x = frequency of cat Set y = frequency of dog Set z = frequency of goldfish The result is a "term vector" with 3 dimensions Calculate cosine of angle between term vectors This is their "cosine similarity"
  • 14. 14 Copyright (c) 2014 Scale Unlimited. Why is scalability hard? Assume you have 8.5 million businesses in the US There are N^2/2 pairs to evaluateโ‰ˆ That's 36 trillion comparisons Sometimes you can quickly trim this problem E.g. if you assume the ZIP code exists, and must match Then this becomes about 4 billion comparisons But often you don't have a "magic" field
  • 15. 15 Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. DataStax Web Site Page Recommender
  • 16. 16 Copyright (c) 2014 Scale Unlimited. How to recommend pages? Besides manually adding a bunch of links... Which is tedious, doesn't scale well, and gets busy
  • 17. 17 Copyright (c) 2014 Scale Unlimited. Can we exploit other users? Classic shopping cart analysis "Users who bought X also bought Y" Based on actual activity, versus (noisy, skewed) ratings
  • 18. 18 Copyright (c) 2014 Scale Unlimited. What's the general approach? We have web logs with IP addresses, time, path to page 157.55.33.39 - - [18/Mar/2014:00:01:00 -0500] "GET /solutions/nosql HTTP/1.1" A browsing session is a series of requests from one IP address With some maximum time gap between requests Find sessions "similar to" the current user's session Recommend pages from these similar sessions
  • 19. 19 Copyright (c) 2014 Scale Unlimited. How to find similar sessions? Create a Lucene search index with one document per session Each indexed document contains the page paths for one session session-1 /path/to/page1, /path/to/page2, /path/to/page3 session-2 /path/to/pageX, /path/to/pageY Search for paths from the current user's session
  • 20. 20 Copyright (c) 2014 Scale Unlimited. Why is this a search issue? Solr (search in general) is all about similarity Find documents similar to the words in my query Cosine similarity is used to calculate similarity Between the term vector for my query and the term vector of each document
  • 21. 21 Copyright (c) 2014 Scale Unlimited. What's the algorithm? Find sessions similar to the target (current user's) session Calculate similarity between these sessions and the target session Aggregate similarity scores for all paths from these sessions Remove paths that are already in the target session Recommend the highest scoring path(s)
  • 22. 22 Copyright (c) 2014 Scale Unlimited. Why do you sum similarities? Give more weight to pages from sessions that are more similar Pages from more similar sessions are assumed to be more interesting
  • 23. 23 Copyright (c) 2014 Scale Unlimited. What are some problems? The classic problem is that we recommend "common" pages E.g. if you haven't viewed the top-level page in your session But this page is very common in most of the other sessions So then it becomes one of the top recommended page But that generally stinks as a recommendation
  • 24. 24 Copyright (c) 2014 Scale Unlimited. Can RowSimilarityJob help? Part of the Mahout open source project Takes as input a table of users (one per row) with lists of items Generates an item-item co-occurrence matrix Values are weights calculated using log-likelihood ratio (LLR) Unsurprising (common) items get low weights If we run it on our data, where users = sessions and items = pages We get page-page co-occurrence matrix Page 1 Page 2 Page 3 Page 1 2.1 0.8 Page 2 2.1 4.5 Page 3 0.8 4.5
  • 25. 25 Copyright (c) 2014 Scale Unlimited. How to use co-occurrence? Convert the matrix into an index Each row is one Lucene document Drop any low-scoring entries Create list of "related" pages Search in Related Pages field Using pages from current session So Page 2 recommends Page 1 & 3 Page 1 Page 2 Page 3 Page 1 2.1 0.8 Page 2 2.1 4.5 Page 3 0.8 4.5 Related Pages Page 1 Page 2 Page 2 Page 1, Page 3 Page 3 Page 2
  • 26. 26 Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. EWS Entity Resolution Entity Resolution
  • 27. 27 Copyright (c) 2014 Scale Unlimited. What is Early Warning? Early Warning helps banks fight fraud It's owned by the top 5 US banks And gets data from 800+ financial institutions So they have details on most US bank accounts When somebody signs up for an account They need to quickly match the person to "known entities" And derive a risk score based on related account details
  • 28. 28 Copyright (c) 2014 Scale Unlimited. Why do they need similarity? Assume you have information on 100s of millions of entities Name(s), address(es), phone number(s), etc. And often a unique ID (Social Security Number, EIN, etc) Why is this a similarity problem? Data is noisy - typos, abbreviations, partial data People lie - much fraud starts with opening an account using bad data
  • 29. 29 Copyright (c) 2014 Scale Unlimited. How does search help? We can quickly build a list of candidate entities, using search Query contains field data provided by the client bank Significantly less than 1 second for 30 candidate entities Then do more precise, sophisticated and CPU-intensive scoring The end result is a ranked list of entities with similarity scores Which then is used to look up account status, fraud cases, etc.
  • 30. 30 Copyright (c) 2014 Scale Unlimited. What's the data pipeline? Incoming data is cleaned up/normalized in Hadoop Simple things like space stripping Also phone number formatting ZIP+4 expansion into just ZIP plus full Other normalization happens inside of Solr This gets loaded into Cassandra tables And automatically indexed by Solr, via DataStax Enterprise ZIP+4 Terms 95014-2127 95014, 2127 Phone Terms 4805551212 480, 5551212
  • 31. 31 Copyright (c) 2014 Scale Unlimited. What's the Solr setup? Each field in the index has very specific analysis Simple things like normalization Synonym expansion for names, abbreviations Split up fields so partial matches work At query time we can weight the importance of each field Which helps order the top N candidates similar to their real match scores E.g. an SSN matching means much more than a first name matching
  • 32. 32 Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Batch Similarity
  • 33. 33 Copyright (c) 2014 Scale Unlimited. Can we do batch similarity? Search works well for real-time similarity But batch processing at scale maxes out the search system We can use two different techniques with Hadoop for batch SimHash - good for text document similarity Parallel Set-Similarity Joins - good for record similarity
  • 34. 34 Copyright (c) 2014 Scale Unlimited. What is SimHash? Assume a document is a set of (unique) words Calculate a hash for each word Probability that the minimum hash is the same for two documents... ...is magically equal to the Jaccard Coefficient Term Hash bosnia 78954874223 is 53466156768 the 5064199193 largest 3193621783 geographic -5718349925
  • 35. 35 Copyright (c) 2014 Scale Unlimited. What is a SimHash workflow? Calculate N hash values Easy way is to use the N smallest hash values Calculate number of matching hash values between doc pairs (M) Then the Jaccard Coefficient is M/Nโ‰ˆ Only works if N is much smaller than # of unique words in docs Implementation of this in cascading.utils open source project https://github.com/ScaleUnlimited/cascading.utils
  • 36. 36 Copyright (c) 2014 Scale Unlimited. What is Set-Similarity Join? Joining records in two sets that are "close enough" aka "fuzzy join" Requires generation of "tokens" from record field(s) Typically words from text Simple implementation has three phases First calculate counts for each unique token value Then output <token, record> for N most common tokens of each record Group by token, compare records in each group
  • 37. 37 Copyright (c) 2014 Scale Unlimited. How does fuzzy join work? For two records to be "similar enough"... They need to share one of their common tokens Generalization of the ZIP code "magic field" approach Basic implementation has a number of issues Passing around copies of full record is inefficient Too-common tokens create huge groups for comparison Two records compared multiple times
  • 38. 38 Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Summary
  • 39. 39 Copyright (c) 2014 Scale Unlimited. The Net-Net Similarity is a common requirement for many applications Recommendations Entity matching Combining Hadoop with search is a powerful combination Scalability Performance Flexibility
  • 40. 40 Copyright (c) 2014 Scale Unlimited. Questions? Feel free to contact me http://www.scaleunlimited.com/contact/ Take a look at Pat Ferrel's Hadoop + Solr recommender http://github.com/pferrel/solr-recommender Check out Mahout http://mahout.apache.org Read paper & code for fuzzyjoin project http://asterix.ics.uci.edu/fuzzyjoin/

Editor's Notes

  1. In nine years of using Hadoop &amp; Solr, I&amp;apos;ve made a lot of mistakes
  2. Open source is filled with key/value systems. My goal in the next three slides isn&amp;apos;t to give a lecture on similarity. Covered in lots of detail by books, papers, etc. Providing context for discussion on the real-world problems and solutions
  3. This text comes from two different versions of the Wikipedia page on Bosnia &amp; Herzegovina. We read it for meaning, and that&amp;apos;s similar - but how would a computer decide these are &amp;quot;similar&amp;quot;?
  4. Looking at these two people, a person can say &amp;quot;yes, they&amp;apos;re the same&amp;quot;.
  5. Looking at these two people, a person can say &amp;quot;They&amp;apos;re likely to be the same&amp;quot;. Bob vs. Robert, missing middle initial, no apartment, typo in street number, abbreviations, missing zip, etc.
  6. Obviously a typical document can have thousands of unique words So very high dimensionality for the term vector
  7. This assumes symmetry - the score of A compared to B is the same as B compared to A
  8. This is from a module in the DataStax Solr course It uses real page-view data from the DataStax web site
  9. Ted Dunning talks about this approach frequently. LLR is used determine which co-occurrences are sufficiently anomalous to be of interest as indicators Challenges in that RowSimilarityJob wants just integer ids for everything, so some back-and-forth conversion is neededPat Ferrel has a project that implements much of this approach.
  10. We used Hadoop to process the original logs And we can use Hadoop to generate this co-occurrence matrix Then we use Solr/Lucene to search for items to recommend
  11. It&amp;apos;s amazing how often something like a phone number, or even an SSN, gets entered incorrectly.
  12. Performance is mostly impacted by the complexity (# of fields) in the query. Typically a query is &amp;lt; 200ms.
  13. Essentially we&amp;apos;re trying to mimic much of what the more sophisticated matching does But without impacting search performance Within the constraints of Lucene/Solr
  14. Really this is a generalization of the magic field approach. Trying to reduce the number of record-record comparisons.