Similarity at Scale

Copyright (c) 2014 Scale Unlimited.
1
Similarity at Scale
Fuzzy matching and recommendations
using Hadoop, Solr, and heuristics
Ken Krugler
Scale Unlimited

The Twitter Pitch
Wide class of problems that rely on "good" similarity
Fast
Accurate
Scalable
Beneﬁt from my mistakes
Scale Unlimited - consulting & training
Talking about solutions to real problems
2

What are similarity problems?
Clustering
Grouping similar advertisers
Deduplication
Joining noisy sets of POI data
Recommendations
Suggesting pages to users
Entity resolution
Fuzzy matching of people and companies
3

What is "Similarity"?
Exact matching is easy(er)
Accuracy is a given
Fast and scalable can still be hard
Lots of key/value systems like Cassandra, HBase, etc.
Fuzzy matching is harder
Two "things" aren't exactly the same
Similarity is based on comparing features
4

Between two articles?
Features could be a bag of words
Are these two articles the same?
5
Bosnia is the largest geographic
region of the modern state with a
moderate continental climate,
marked by hot summers and cold,
snowy winters.
The inland is a geographically
larger region and has a moderate
continental climate, bookended by
hot summers and cold and snowy
winters.

What about now?
Easy to create challenging situations for a person
Which is an impossible problem for a computer
Need to distinguish between "conceptually similar" and "derived from"
6
Bosnia is the largest geographic
region of the modern state with a
moderate continental climate,
marked by hot summers and cold,
snowy winters.
Bosnia has a warm European
climate, though the summers can
be hot and the winters are often
cold and wet.

Between two records?
Features could be ﬁeld values
Are these two people the same?
7
Name
Address
City
State
Zip
Bob Bogus Robert Bogus
220 3rd Avenue 220 3rd Avenue
Seattle Seattle
WA WA
98104-2608 98104

What about now?
Need to get rid of false differences caused by abbreviations
How does a computer know what's a "signiﬁcant" difference?
8
Name
Address
City
State
Zip
Bob Bogus Robert H. Bogus
Apt 102, 3220 3rd Ave 220 3rd Avenue South
Seattle Seattle
Washington WA
98104

Between two users?
Features could be...
Items a user has bought
Are these two users the same?
9
User 1 User 2

What about now?
Need more generic features
E.g. product categories
10
User 1 User 2

How to measure similarity?
Assuming you have some features for two "things"
How does a program determine their degree of similarity?
You want a number that represents their "closeness"
Typically 1.0 means exactly the same
And 0.0 means completely different
11

Jaccard Coefﬁcient
Ratio of number of items in common / total number of items
Where "items" typical means unique values (sets of things)
So 1.0 is exactly the same, and 0.0 is completely different
12
Jaccard(A, B) =
A!B
A"B

Cosine Similarity
Assume a document only has three unique words
cat, dog, goldﬁsh
Set x = frequency of cat
Set y = frequency of dog
Set z = frequency of goldﬁsh
The result is a "term vector" with 3 dimensions
Calculate cosine of angle between term vectors
This is their "cosine similarity"
13

Why is scalability hard?
Assume you have 8.5 million businesses in the US
There are ≈ N^2/2 pairs to evaluate
That's 36 trillion comparisons
Sometimes you can quickly trim this problem
E.g. if you assume the ZIP code exists, and must match
Then this becomes about 4 billion comparisons
But often you don't have a "magic" ﬁeld
14

Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
DataStax Web
Site Page
Recommender
15

How to recommend pages?
Besides manually adding a bunch of links...
Which is tedious, doesn't scale well, and gets busy
16

Can we exploit other users?
Classic shopping cart analysis
"Users who bought X also bought Y"
Based on actual activity, versus (noisy, skewed) ratings
17

What's the general approach?
We have web logs with IP addresses, time, path to page
157.55.33.39 - - [18/Mar/2014:00:01:00 -0500]
"GET /solutions/nosql HTTP/1.1"
A browsing session is a series of requests from one IP address
With some maximum time gap between requests
Find sessions "similar to" the current user's session
Recommend pages from these similar sessions
18

How to ﬁnd similar sessions?
Create a Lucene search index with one document per session
Each indexed document contains the page paths for one session
session-1 /path/to/page1, /path/to/page2, /path/to/page3
session-2 /path/to/pageX, /path/to/pageY
Search for paths from the current user's session
19

Why is this a search issue?
Solr (search in general) is all about similarity
Find documents similar to the words in my query
Cosine similarity is used to calculate similarity
Between the term vector for my query
and the term vector of each document
20

What's the algorithm?
Find sessions similar to the target (current user's) session
Calculate similarity between these sessions and the target session
Aggregate similarity scores for all paths from these sessions
Remove paths that are already in the target session
Recommend the highest scoring path(s)
21

Why do you sum similarities?
Give more weight to pages from sessions that are more similar
Pages from more similar sessions are assumed to be more interesting
22
F
D
B
C
A
Jaccard = 0.2
(1 / 5)
Session 2 vs Target Session
E
D B
C
A
Jaccard = 0.4
(2 / 5)
Session 1 vs Target Session
D
E
F
0.6 (0.4 + 0.2)
0.4
0.2
Page Score

What are some problems?
The classic problem is that we recommend "common" pages
23

E.g. if you haven't viewed the top-level page in your session
23

But this page is very common in most of the other sessions
23

So then it becomes one of the top recommended page
23

So then it becomes one of the top recommended page
But that generally stinks as a recommendation
23

Can RowSimilarityJob help?
24
Page 1 Page 2 Page 3
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5

Part of the Mahout open source project
24
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5

Takes as input a table of users (one per row) with lists of items
24
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5

Generates an item-item co-occurrence matrix
24
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5

Values are weights calculated using log-likelihood ratio (LLR)
24
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5

Unsurprising (common) items get low weights
24
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5

If we run it on our data, where users = sessions and items = pages
24
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5

If we run it on our data, where users = sessions and items = pages
We get page-page co-occurrence matrix
24
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5

How to use co-occurrence?
25
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Related Pages
Page 1
Page 2
Page 3
Page 2
Page 1, Page 3
Page 2

Convert the matrix into an index
25
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Related Pages
Page 1
Page 2
Page 3
Page 2
Page 1, Page 3
Page 2

Each row is one Lucene document
25
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Related Pages
Page 1
Page 2
Page 3
Page 2
Page 1, Page 3
Page 2

Drop any low-scoring entries
25
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Related Pages
Page 1
Page 2
Page 3
Page 2
Page 1, Page 3
Page 2

Create list of "related" pages
25
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Related Pages
Page 1
Page 2
Page 3
Page 2
Page 1, Page 3
Page 2

Search in Related Pages ﬁeld
25
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Related Pages
Page 1
Page 2
Page 3
Page 2
Page 1, Page 3
Page 2

Using pages from current session
25
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Related Pages
Page 1
Page 2
Page 3
Page 2
Page 1, Page 3
Page 2

Using pages from current session
So Page 2 recommends Page 1 & 3
25
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5
0.8 4.5
Related Pages
Page 1
Page 2
Page 3
Page 2
Page 1, Page 3
Page 2

EWS
Entity
Resolution
26

What is Early Warning?
Early Warning helps banks ﬁght fraud
It's owned by the top 5 US banks
And gets data from 800+ ﬁnancial institutions
So they have details on most US bank accounts
When somebody signs up for an account
They need to quickly match the person to "known entities"
And derive a risk score based on related account details
27

Why do they need similarity?
Assume you have information on 100s of millions of entities
Name(s), address(es), phone number(s), etc.
And often a unique ID (Social Security Number, EIN, etc)
Why is this a similarity problem?
Data is noisy - typos, abbreviations, partial data
People lie - much fraud starts with opening an account using bad data
28

How does search help?
We can quickly build a list of candidate entities, using search
Query contains ﬁeld data provided by the client bank
Signiﬁcantly less than 1 second for 30 candidate entities
Then do more precise, sophisticated and CPU-intensive scoring
The end result is a ranked list of entities with similarity scores
Which then is used to look up account status, fraud cases, etc.
29

What's the data pipeline?
Incoming data is cleaned up/normalized in Hadoop
Simple things like space stripping
Also phone number formatting
ZIP+4 expansion into just ZIP plus full
Other normalization happens inside of Solr
This gets loaded into Cassandra tables
And automatically indexed by Solr, via DataStax Enterprise
30
ZIP+4 Terms
95014-2127 95014, 2127
Phone Terms
4805551212 480, 5551212

What's the Solr setup?
Each field in the index has very specific analysis
Simple things like normalization
Synonym expansion for names, abbreviations
Split up fields so partial matches work
At query time we can weight the importance of each field
Which helps order the top N candidates similar to their real match scores
E.g. an SSN matching means much more than a first name matching
31

Batch Similarity
32

Can we do batch similarity?
Search works well for real-time similarity
But batch processing at scale maxes out the search system
We can use two different techniques with Hadoop for batch
SimHash - good for text document similarity
Parallel Set-Similarity Joins - good for record similarity
33

What is SimHash?
Assume a document is a set of (unique) words
Calculate a hash for each word
Probability that the minimum hash is the same for two documents...
...is magically equal to the Jaccard Coefﬁcient
34
Term Hash
bosnia
is
the
largest
geographic
78954874223
53466156768
5064199193
3193621783
-5718349925

What is a SimHash workﬂow?
Calculate N hash values
Easy way is to use the N smallest hash values
Calculate number of matching hash values between doc pairs (M)
Then the Jaccard Coefﬁcient is ≈ M/N
Only works if N is much smaller than # of unique words in docs
Implementation of this in cascading.utils open source project
https://github.com/ScaleUnlimited/cascading.utils
35

What is Set-Similarity Join?
Joining records in two sets that are "close enough"
aka "fuzzy join"
Requires generation of "tokens" from record ﬁeld(s)
Typically words from text
Simple implementation has three phases
First calculate counts for each unique token value
Then output <token, record> for N most common tokens of each record
Group by token, compare records in each group
36

How does fuzzy join work?
For two records to be "similar enough"...
They need to share one of their common tokens
Generalization of the ZIP code "magic ﬁeld" approach
Basic implementation has a number of issues
Passing around copies of full record is inefﬁcient
Too-common tokens create huge groups for comparison
Two records compared multiple times
37

Summary
38

The Net-Net
39

The Net-Net
Similarity is a common requirement for many applications
39

The Net-Net
Recommendations
39

The Net-Net
Recommendations
Entity matching
39

The Net-Net
Recommendations
Entity matching
Combining Hadoop with search is a powerful combination
39

The Net-Net
Recommendations
Entity matching
Scalability
39

The Net-Net
Recommendations
Entity matching
Scalability
Performance
39

The Net-Net
Recommendations
Entity matching
Scalability
Performance
Flexibility
39

Questions?
Feel free to contact me
http://www.scaleunlimited.com/contact/
Take a look at Pat Ferrel's Hadoop + Solr recommender
http://github.com/pferrel/solr-recommender
Check out Mahout
http://mahout.apache.org
Read paper & code for fuzzyjoin project
http://asterix.ics.uci.edu/fuzzyjoin/
40

Similarity at Scale

More Related Content

What's hot

Similar to Similarity at Scale

More from DataWorks Summit

Recently uploaded

Similarity at Scale