Copyright (c) 2014 Scale Unlimited.
1
Similarity at Scale
Fuzzy matching and recommendations
using Hadoop, Solr, and heuri...
Copyright (c) 2014 Scale Unlimited.
The Twitter Pitch
Wide class of problems that rely on "good" similarity
Fast
Accurate
...
Copyright (c) 2014 Scale Unlimited.
What are similarity problems?
Clustering
Grouping similar advertisers
Deduplication
Jo...
Copyright (c) 2014 Scale Unlimited.
What is "Similarity"?
Exact matching is easy(er)
Accuracy is a given
Fast and scalable...
Copyright (c) 2014 Scale Unlimited.
Between two articles?
Features could be a bag of words
Are these two articles the same...
Copyright (c) 2014 Scale Unlimited.
What about now?
Easy to create challenging situations for a person
Which is an impossi...
Copyright (c) 2014 Scale Unlimited.
Between two records?
Features could be field values
Are these two people the same?
7
Na...
Copyright (c) 2014 Scale Unlimited.
What about now?
Need to get rid of false differences caused by abbreviations
How does ...
Copyright (c) 2014 Scale Unlimited.
Between two users?
Features could be...
Items a user has bought
Are these two users th...
Copyright (c) 2014 Scale Unlimited.
What about now?
Need more generic features
E.g. product categories
10
User 1 User 2
Copyright (c) 2014 Scale Unlimited.
How to measure similarity?
Assuming you have some features for two "things"
How does a...
Copyright (c) 2014 Scale Unlimited.
Jaccard Coefficient
Ratio of number of items in common / total number of items
Where "i...
Copyright (c) 2014 Scale Unlimited.
Cosine Similarity
Assume a document only has three unique words
cat, dog, goldfish
Set ...
Copyright (c) 2014 Scale Unlimited.
Why is scalability hard?
Assume you have 8.5 million businesses in the US
There are ≈ ...
Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form wi...
Copyright (c) 2014 Scale Unlimited.
How to recommend pages?
Besides manually adding a bunch of links...
Which is tedious, ...
Copyright (c) 2014 Scale Unlimited.
Can we exploit other users?
Classic shopping cart analysis
"Users who bought X also bo...
Copyright (c) 2014 Scale Unlimited.
What's the general approach?
We have web logs with IP addresses, time, path to page
15...
Copyright (c) 2014 Scale Unlimited.
How to find similar sessions?
Create a Lucene search index with one document per sessio...
Copyright (c) 2014 Scale Unlimited.
Why is this a search issue?
Solr (search in general) is all about similarity
Find docu...
Copyright (c) 2014 Scale Unlimited.
What's the algorithm?
Find sessions similar to the target (current user's) session
Cal...
Copyright (c) 2014 Scale Unlimited.
Why do you sum similarities?
Give more weight to pages from sessions that are more sim...
Copyright (c) 2014 Scale Unlimited.
What are some problems?
The classic problem is that we recommend "common" pages
23
Copyright (c) 2014 Scale Unlimited.
What are some problems?
The classic problem is that we recommend "common" pages
E.g. i...
Copyright (c) 2014 Scale Unlimited.
What are some problems?
The classic problem is that we recommend "common" pages
E.g. i...
Copyright (c) 2014 Scale Unlimited.
What are some problems?
The classic problem is that we recommend "common" pages
E.g. i...
Copyright (c) 2014 Scale Unlimited.
What are some problems?
The classic problem is that we recommend "common" pages
E.g. i...
Copyright (c) 2014 Scale Unlimited.
Can RowSimilarityJob help?
24
Page 1 Page 2 Page 3
Page 1
Page 2
Page 3
2.1 0.8
2.1 4....
Copyright (c) 2014 Scale Unlimited.
Can RowSimilarityJob help?
Part of the Mahout open source project
24
Page 1 Page 2 Pag...
Copyright (c) 2014 Scale Unlimited.
Can RowSimilarityJob help?
Part of the Mahout open source project
Takes as input a tab...
Copyright (c) 2014 Scale Unlimited.
Can RowSimilarityJob help?
Part of the Mahout open source project
Takes as input a tab...
Copyright (c) 2014 Scale Unlimited.
Can RowSimilarityJob help?
Part of the Mahout open source project
Takes as input a tab...
Copyright (c) 2014 Scale Unlimited.
Can RowSimilarityJob help?
Part of the Mahout open source project
Takes as input a tab...
Copyright (c) 2014 Scale Unlimited.
Can RowSimilarityJob help?
Part of the Mahout open source project
Takes as input a tab...
Copyright (c) 2014 Scale Unlimited.
Can RowSimilarityJob help?
Part of the Mahout open source project
Takes as input a tab...
Copyright (c) 2014 Scale Unlimited.
How to use co-occurrence?
25
Page 1 Page 2 Page 3
Page 1
Page 2
Page 3
2.1 0.8
2.1 4.5...
Copyright (c) 2014 Scale Unlimited.
How to use co-occurrence?
Convert the matrix into an index
25
Page 1 Page 2 Page 3
Pag...
Copyright (c) 2014 Scale Unlimited.
How to use co-occurrence?
Convert the matrix into an index
Each row is one Lucene docu...
Copyright (c) 2014 Scale Unlimited.
How to use co-occurrence?
Convert the matrix into an index
Each row is one Lucene docu...
Copyright (c) 2014 Scale Unlimited.
How to use co-occurrence?
Convert the matrix into an index
Each row is one Lucene docu...
Copyright (c) 2014 Scale Unlimited.
How to use co-occurrence?
Convert the matrix into an index
Each row is one Lucene docu...
Copyright (c) 2014 Scale Unlimited.
How to use co-occurrence?
Convert the matrix into an index
Each row is one Lucene docu...
Copyright (c) 2014 Scale Unlimited.
How to use co-occurrence?
Convert the matrix into an index
Each row is one Lucene docu...
Copyright (c) 2014 Scale Unlimited.
How to use co-occurrence?
Convert the matrix into an index
Each row is one Lucene docu...
Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form wi...
Copyright (c) 2014 Scale Unlimited.
What is Early Warning?
Early Warning helps banks fight fraud
It's owned by the top 5 US...
Copyright (c) 2014 Scale Unlimited.
Why do they need similarity?
Assume you have information on 100s of millions of entiti...
Copyright (c) 2014 Scale Unlimited.
How does search help?
We can quickly build a list of candidate entities, using search
...
Copyright (c) 2014 Scale Unlimited.
What's the data pipeline?
Incoming data is cleaned up/normalized in Hadoop
Simple thin...
Copyright (c) 2014 Scale Unlimited.
What's the Solr setup?
Each field in the index has very specific analysis
Simple things ...
Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form wi...
Copyright (c) 2014 Scale Unlimited.
Can we do batch similarity?
Search works well for real-time similarity
But batch proce...
Copyright (c) 2014 Scale Unlimited.
What is SimHash?
Assume a document is a set of (unique) words
Calculate a hash for eac...
Copyright (c) 2014 Scale Unlimited.
What is a SimHash workflow?
Calculate N hash values
Easy way is to use the N smallest h...
Copyright (c) 2014 Scale Unlimited.
What is Set-Similarity Join?
Joining records in two sets that are "close enough"
aka "...
Copyright (c) 2014 Scale Unlimited.
How does fuzzy join work?
For two records to be "similar enough"...
They need to share...
Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form wi...
Copyright (c) 2014 Scale Unlimited.
The Net-Net
39
Copyright (c) 2014 Scale Unlimited.
The Net-Net
Similarity is a common requirement for many applications
39
Copyright (c) 2014 Scale Unlimited.
The Net-Net
Similarity is a common requirement for many applications
Recommendations
39
Copyright (c) 2014 Scale Unlimited.
The Net-Net
Similarity is a common requirement for many applications
Recommendations
E...
Copyright (c) 2014 Scale Unlimited.
The Net-Net
Similarity is a common requirement for many applications
Recommendations
E...
Copyright (c) 2014 Scale Unlimited.
The Net-Net
Similarity is a common requirement for many applications
Recommendations
E...
Copyright (c) 2014 Scale Unlimited.
The Net-Net
Similarity is a common requirement for many applications
Recommendations
E...
Copyright (c) 2014 Scale Unlimited.
The Net-Net
Similarity is a common requirement for many applications
Recommendations
E...
Copyright (c) 2014 Scale Unlimited.
Questions?
Feel free to contact me
http://www.scaleunlimited.com/contact/
Take a look ...
Upcoming SlideShare
Loading in...5
×

Similarity at Scale

1,478

Published on

Published in: Technology
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,478
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
12
Embeds 0
No embeds

No notes for slide

Similarity at Scale

  1. 1. Copyright (c) 2014 Scale Unlimited. 1 Similarity at Scale Fuzzy matching and recommendations using Hadoop, Solr, and heuristics Ken Krugler Scale Unlimited
  2. 2. Copyright (c) 2014 Scale Unlimited. The Twitter Pitch Wide class of problems that rely on "good" similarity Fast Accurate Scalable Benefit from my mistakes Scale Unlimited - consulting & training Talking about solutions to real problems 2
  3. 3. Copyright (c) 2014 Scale Unlimited. What are similarity problems? Clustering Grouping similar advertisers Deduplication Joining noisy sets of POI data Recommendations Suggesting pages to users Entity resolution Fuzzy matching of people and companies 3
  4. 4. Copyright (c) 2014 Scale Unlimited. What is "Similarity"? Exact matching is easy(er) Accuracy is a given Fast and scalable can still be hard Lots of key/value systems like Cassandra, HBase, etc. Fuzzy matching is harder Two "things" aren't exactly the same Similarity is based on comparing features 4
  5. 5. Copyright (c) 2014 Scale Unlimited. Between two articles? Features could be a bag of words Are these two articles the same? 5 Bosnia is the largest geographic region of the modern state with a moderate continental climate, marked by hot summers and cold, snowy winters. The inland is a geographically larger region and has a moderate continental climate, bookended by hot summers and cold and snowy winters.
  6. 6. Copyright (c) 2014 Scale Unlimited. What about now? Easy to create challenging situations for a person Which is an impossible problem for a computer Need to distinguish between "conceptually similar" and "derived from" 6 Bosnia is the largest geographic region of the modern state with a moderate continental climate, marked by hot summers and cold, snowy winters. Bosnia has a warm European climate, though the summers can be hot and the winters are often cold and wet.
  7. 7. Copyright (c) 2014 Scale Unlimited. Between two records? Features could be field values Are these two people the same? 7 Name Address City State Zip Bob Bogus Robert Bogus 220 3rd Avenue 220 3rd Avenue Seattle Seattle WA WA 98104-2608 98104
  8. 8. Copyright (c) 2014 Scale Unlimited. What about now? Need to get rid of false differences caused by abbreviations How does a computer know what's a "significant" difference? 8 Name Address City State Zip Bob Bogus Robert H. Bogus Apt 102, 3220 3rd Ave 220 3rd Avenue South Seattle Seattle Washington WA 98104
  9. 9. Copyright (c) 2014 Scale Unlimited. Between two users? Features could be... Items a user has bought Are these two users the same? 9 User 1 User 2
  10. 10. Copyright (c) 2014 Scale Unlimited. What about now? Need more generic features E.g. product categories 10 User 1 User 2
  11. 11. Copyright (c) 2014 Scale Unlimited. How to measure similarity? Assuming you have some features for two "things" How does a program determine their degree of similarity? You want a number that represents their "closeness" Typically 1.0 means exactly the same And 0.0 means completely different 11
  12. 12. Copyright (c) 2014 Scale Unlimited. Jaccard Coefficient Ratio of number of items in common / total number of items Where "items" typical means unique values (sets of things) So 1.0 is exactly the same, and 0.0 is completely different 12 Jaccard(A, B) = A!B A"B
  13. 13. Copyright (c) 2014 Scale Unlimited. Cosine Similarity Assume a document only has three unique words cat, dog, goldfish Set x = frequency of cat Set y = frequency of dog Set z = frequency of goldfish The result is a "term vector" with 3 dimensions Calculate cosine of angle between term vectors This is their "cosine similarity" 13
  14. 14. Copyright (c) 2014 Scale Unlimited. Why is scalability hard? Assume you have 8.5 million businesses in the US There are ≈ N^2/2 pairs to evaluate That's 36 trillion comparisons Sometimes you can quickly trim this problem E.g. if you assume the ZIP code exists, and must match Then this becomes about 4 billion comparisons But often you don't have a "magic" field 14
  15. 15. Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. DataStax Web Site Page Recommender 15
  16. 16. Copyright (c) 2014 Scale Unlimited. How to recommend pages? Besides manually adding a bunch of links... Which is tedious, doesn't scale well, and gets busy 16
  17. 17. Copyright (c) 2014 Scale Unlimited. Can we exploit other users? Classic shopping cart analysis "Users who bought X also bought Y" Based on actual activity, versus (noisy, skewed) ratings 17
  18. 18. Copyright (c) 2014 Scale Unlimited. What's the general approach? We have web logs with IP addresses, time, path to page 157.55.33.39 - - [18/Mar/2014:00:01:00 -0500] "GET /solutions/nosql HTTP/1.1" A browsing session is a series of requests from one IP address With some maximum time gap between requests Find sessions "similar to" the current user's session Recommend pages from these similar sessions 18
  19. 19. Copyright (c) 2014 Scale Unlimited. How to find similar sessions? Create a Lucene search index with one document per session Each indexed document contains the page paths for one session session-1 /path/to/page1, /path/to/page2, /path/to/page3 session-2 /path/to/pageX, /path/to/pageY Search for paths from the current user's session 19
  20. 20. Copyright (c) 2014 Scale Unlimited. Why is this a search issue? Solr (search in general) is all about similarity Find documents similar to the words in my query Cosine similarity is used to calculate similarity Between the term vector for my query and the term vector of each document 20
  21. 21. Copyright (c) 2014 Scale Unlimited. What's the algorithm? Find sessions similar to the target (current user's) session Calculate similarity between these sessions and the target session Aggregate similarity scores for all paths from these sessions Remove paths that are already in the target session Recommend the highest scoring path(s) 21
  22. 22. Copyright (c) 2014 Scale Unlimited. Why do you sum similarities? Give more weight to pages from sessions that are more similar Pages from more similar sessions are assumed to be more interesting 22 F D B C A Jaccard = 0.2 (1 / 5) Session 2 vs Target Session E D B C A Jaccard = 0.4 (2 / 5) Session 1 vs Target Session D E F 0.6 (0.4 + 0.2) 0.4 0.2 Page Score
  23. 23. Copyright (c) 2014 Scale Unlimited. What are some problems? The classic problem is that we recommend "common" pages 23
  24. 24. Copyright (c) 2014 Scale Unlimited. What are some problems? The classic problem is that we recommend "common" pages E.g. if you haven't viewed the top-level page in your session 23
  25. 25. Copyright (c) 2014 Scale Unlimited. What are some problems? The classic problem is that we recommend "common" pages E.g. if you haven't viewed the top-level page in your session But this page is very common in most of the other sessions 23
  26. 26. Copyright (c) 2014 Scale Unlimited. What are some problems? The classic problem is that we recommend "common" pages E.g. if you haven't viewed the top-level page in your session But this page is very common in most of the other sessions So then it becomes one of the top recommended page 23
  27. 27. Copyright (c) 2014 Scale Unlimited. What are some problems? The classic problem is that we recommend "common" pages E.g. if you haven't viewed the top-level page in your session But this page is very common in most of the other sessions So then it becomes one of the top recommended page But that generally stinks as a recommendation 23
  28. 28. Copyright (c) 2014 Scale Unlimited. Can RowSimilarityJob help? 24 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5
  29. 29. Copyright (c) 2014 Scale Unlimited. Can RowSimilarityJob help? Part of the Mahout open source project 24 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5
  30. 30. Copyright (c) 2014 Scale Unlimited. Can RowSimilarityJob help? Part of the Mahout open source project Takes as input a table of users (one per row) with lists of items 24 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5
  31. 31. Copyright (c) 2014 Scale Unlimited. Can RowSimilarityJob help? Part of the Mahout open source project Takes as input a table of users (one per row) with lists of items Generates an item-item co-occurrence matrix 24 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5
  32. 32. Copyright (c) 2014 Scale Unlimited. Can RowSimilarityJob help? Part of the Mahout open source project Takes as input a table of users (one per row) with lists of items Generates an item-item co-occurrence matrix Values are weights calculated using log-likelihood ratio (LLR) 24 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5
  33. 33. Copyright (c) 2014 Scale Unlimited. Can RowSimilarityJob help? Part of the Mahout open source project Takes as input a table of users (one per row) with lists of items Generates an item-item co-occurrence matrix Values are weights calculated using log-likelihood ratio (LLR) Unsurprising (common) items get low weights 24 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5
  34. 34. Copyright (c) 2014 Scale Unlimited. Can RowSimilarityJob help? Part of the Mahout open source project Takes as input a table of users (one per row) with lists of items Generates an item-item co-occurrence matrix Values are weights calculated using log-likelihood ratio (LLR) Unsurprising (common) items get low weights If we run it on our data, where users = sessions and items = pages 24 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5
  35. 35. Copyright (c) 2014 Scale Unlimited. Can RowSimilarityJob help? Part of the Mahout open source project Takes as input a table of users (one per row) with lists of items Generates an item-item co-occurrence matrix Values are weights calculated using log-likelihood ratio (LLR) Unsurprising (common) items get low weights If we run it on our data, where users = sessions and items = pages We get page-page co-occurrence matrix 24 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5
  36. 36. Copyright (c) 2014 Scale Unlimited. How to use co-occurrence? 25 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5 Related Pages Page 1 Page 2 Page 3 Page 2 Page 1, Page 3 Page 2
  37. 37. Copyright (c) 2014 Scale Unlimited. How to use co-occurrence? Convert the matrix into an index 25 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5 Related Pages Page 1 Page 2 Page 3 Page 2 Page 1, Page 3 Page 2
  38. 38. Copyright (c) 2014 Scale Unlimited. How to use co-occurrence? Convert the matrix into an index Each row is one Lucene document 25 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5 Related Pages Page 1 Page 2 Page 3 Page 2 Page 1, Page 3 Page 2
  39. 39. Copyright (c) 2014 Scale Unlimited. How to use co-occurrence? Convert the matrix into an index Each row is one Lucene document Drop any low-scoring entries 25 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5 Related Pages Page 1 Page 2 Page 3 Page 2 Page 1, Page 3 Page 2
  40. 40. Copyright (c) 2014 Scale Unlimited. How to use co-occurrence? Convert the matrix into an index Each row is one Lucene document Drop any low-scoring entries Create list of "related" pages 25 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5 Related Pages Page 1 Page 2 Page 3 Page 2 Page 1, Page 3 Page 2
  41. 41. Copyright (c) 2014 Scale Unlimited. How to use co-occurrence? Convert the matrix into an index Each row is one Lucene document Drop any low-scoring entries Create list of "related" pages 25 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5 Related Pages Page 1 Page 2 Page 3 Page 2 Page 1, Page 3 Page 2
  42. 42. Copyright (c) 2014 Scale Unlimited. How to use co-occurrence? Convert the matrix into an index Each row is one Lucene document Drop any low-scoring entries Create list of "related" pages Search in Related Pages field 25 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5 Related Pages Page 1 Page 2 Page 3 Page 2 Page 1, Page 3 Page 2
  43. 43. Copyright (c) 2014 Scale Unlimited. How to use co-occurrence? Convert the matrix into an index Each row is one Lucene document Drop any low-scoring entries Create list of "related" pages Search in Related Pages field Using pages from current session 25 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5 Related Pages Page 1 Page 2 Page 3 Page 2 Page 1, Page 3 Page 2
  44. 44. Copyright (c) 2014 Scale Unlimited. How to use co-occurrence? Convert the matrix into an index Each row is one Lucene document Drop any low-scoring entries Create list of "related" pages Search in Related Pages field Using pages from current session So Page 2 recommends Page 1 & 3 25 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 2.1 0.8 2.1 4.5 0.8 4.5 Related Pages Page 1 Page 2 Page 3 Page 2 Page 1, Page 3 Page 2
  45. 45. Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. EWS Entity Resolution 26
  46. 46. Copyright (c) 2014 Scale Unlimited. What is Early Warning? Early Warning helps banks fight fraud It's owned by the top 5 US banks And gets data from 800+ financial institutions So they have details on most US bank accounts When somebody signs up for an account They need to quickly match the person to "known entities" And derive a risk score based on related account details 27
  47. 47. Copyright (c) 2014 Scale Unlimited. Why do they need similarity? Assume you have information on 100s of millions of entities Name(s), address(es), phone number(s), etc. And often a unique ID (Social Security Number, EIN, etc) Why is this a similarity problem? Data is noisy - typos, abbreviations, partial data People lie - much fraud starts with opening an account using bad data 28
  48. 48. Copyright (c) 2014 Scale Unlimited. How does search help? We can quickly build a list of candidate entities, using search Query contains field data provided by the client bank Significantly less than 1 second for 30 candidate entities Then do more precise, sophisticated and CPU-intensive scoring The end result is a ranked list of entities with similarity scores Which then is used to look up account status, fraud cases, etc. 29
  49. 49. Copyright (c) 2014 Scale Unlimited. What's the data pipeline? Incoming data is cleaned up/normalized in Hadoop Simple things like space stripping Also phone number formatting ZIP+4 expansion into just ZIP plus full Other normalization happens inside of Solr This gets loaded into Cassandra tables And automatically indexed by Solr, via DataStax Enterprise 30 ZIP+4 Terms 95014-2127 95014, 2127 Phone Terms 4805551212 480, 5551212
  50. 50. Copyright (c) 2014 Scale Unlimited. What's the Solr setup? Each field in the index has very specific analysis Simple things like normalization Synonym expansion for names, abbreviations Split up fields so partial matches work At query time we can weight the importance of each field Which helps order the top N candidates similar to their real match scores E.g. an SSN matching means much more than a first name matching 31
  51. 51. Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Batch Similarity 32
  52. 52. Copyright (c) 2014 Scale Unlimited. Can we do batch similarity? Search works well for real-time similarity But batch processing at scale maxes out the search system We can use two different techniques with Hadoop for batch SimHash - good for text document similarity Parallel Set-Similarity Joins - good for record similarity 33
  53. 53. Copyright (c) 2014 Scale Unlimited. What is SimHash? Assume a document is a set of (unique) words Calculate a hash for each word Probability that the minimum hash is the same for two documents... ...is magically equal to the Jaccard Coefficient 34 Term Hash bosnia is the largest geographic 78954874223 53466156768 5064199193 3193621783 -5718349925
  54. 54. Copyright (c) 2014 Scale Unlimited. What is a SimHash workflow? Calculate N hash values Easy way is to use the N smallest hash values Calculate number of matching hash values between doc pairs (M) Then the Jaccard Coefficient is ≈ M/N Only works if N is much smaller than # of unique words in docs Implementation of this in cascading.utils open source project https://github.com/ScaleUnlimited/cascading.utils 35
  55. 55. Copyright (c) 2014 Scale Unlimited. What is Set-Similarity Join? Joining records in two sets that are "close enough" aka "fuzzy join" Requires generation of "tokens" from record field(s) Typically words from text Simple implementation has three phases First calculate counts for each unique token value Then output <token, record> for N most common tokens of each record Group by token, compare records in each group 36
  56. 56. Copyright (c) 2014 Scale Unlimited. How does fuzzy join work? For two records to be "similar enough"... They need to share one of their common tokens Generalization of the ZIP code "magic field" approach Basic implementation has a number of issues Passing around copies of full record is inefficient Too-common tokens create huge groups for comparison Two records compared multiple times 37
  57. 57. Copyright (c) 2011-2014 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Summary 38
  58. 58. Copyright (c) 2014 Scale Unlimited. The Net-Net 39
  59. 59. Copyright (c) 2014 Scale Unlimited. The Net-Net Similarity is a common requirement for many applications 39
  60. 60. Copyright (c) 2014 Scale Unlimited. The Net-Net Similarity is a common requirement for many applications Recommendations 39
  61. 61. Copyright (c) 2014 Scale Unlimited. The Net-Net Similarity is a common requirement for many applications Recommendations Entity matching 39
  62. 62. Copyright (c) 2014 Scale Unlimited. The Net-Net Similarity is a common requirement for many applications Recommendations Entity matching Combining Hadoop with search is a powerful combination 39
  63. 63. Copyright (c) 2014 Scale Unlimited. The Net-Net Similarity is a common requirement for many applications Recommendations Entity matching Combining Hadoop with search is a powerful combination Scalability 39
  64. 64. Copyright (c) 2014 Scale Unlimited. The Net-Net Similarity is a common requirement for many applications Recommendations Entity matching Combining Hadoop with search is a powerful combination Scalability Performance 39
  65. 65. Copyright (c) 2014 Scale Unlimited. The Net-Net Similarity is a common requirement for many applications Recommendations Entity matching Combining Hadoop with search is a powerful combination Scalability Performance Flexibility 39
  66. 66. Copyright (c) 2014 Scale Unlimited. Questions? Feel free to contact me http://www.scaleunlimited.com/contact/ Take a look at Pat Ferrel's Hadoop + Solr recommender http://github.com/pferrel/solr-recommender Check out Mahout http://mahout.apache.org Read paper & code for fuzzyjoin project http://asterix.ics.uci.edu/fuzzyjoin/ 40

×