SlideShare a Scribd company logo
1 of 26
Download to read offline
Benchmark MinHash +
LSH Algorithm on Spark
Insight Data Engineering Fellow Program, Silicon Valley
Xiaoqian Liu
June 2016
What’s next?
Post Recommendation
● Data: Reddit posts and titles in 12/2014
● Similarity metric: Jaccard Similarity
○ (%common) on titles
Pairwise Similarity Calculation is
Expensive!!
● ~700k posts in 12/2014
● Individual lookup: 700K times, O(n)
● Pairwise calculation: 490B times, O(n^2)
MinHash: Dimension Reduction
Post 1 Dave Grohl tells a story
Post 2 Dave Grohl shares a story with Taylor Swift
Post 3 I knew it was trouble when they drove by
Min hash 1 Min hash 2 Min hash 3 Min hash 4
Post 1 932378 11070 107000 195512
Post 2 20930 213012 107000 195512
Post 3 27698 14136 104464 154376
4 hash funcs
LSH (Locality Sensitive Hashing)
● Further reduce the dimension
● Suppose the table is divided into 2 bands w/ width of 2
● Rehash on each item
● Use (Band id, Band hash) to find similar items
Band 1 Band 2
Post 1 Hash (932378,11070) Hash (107000,195512)
Post 2 Hash (20930, 213012) Hash (107000,195512)
Post 3 Hash (27698,14136) Hash (104464,154376)
Dave Grohl
Dave Grohl
Trouble
*Algorithm source: Mining of Massive Datasets (Rajaraman,Leskovec)
Infrastructure for Evaluation
● Batch implementation+Eval
● Real-time implementation+Eval
Preprocessing
(tokenize, remove
stopwords)
Minhash+LSH
(batch version)
Minhash+LSH
(online version)
Reddits (1/2015)
Reddits (12/2014)
Export LSH+post info
Group lookup+update
6 nodes
(m4.xlarge)
3 nodes
(m4.xlarge)
Evaluation &
Lookup
6 nodes
(m4.xlarge)
1 node
(m4.xlarge)
Batch Processing Optimization on Spark
● SparkSQL join, cartesian product
● Reduce Shuffle times for joining two different datasets:
○ Co-partition before joining
● Persist the data before actions
○ Storage level depends on the RDD size
● Filter results before joining and calculating similarities
○ filter(), reducebyKey()
Batch Processing: Brute-force vs
Minhash+LSH (10 hash funcs, 2 bands)
100k entries, 12/2014 Reddits
Precision and Recall
● 100k entries, estimated threshold = 0.44
Parameters Items
>=threshold
Total
count
Time (sec) Precision Recall num
partitions
Brute-force 16,046 9.99B 29,880 1 1 3,600
k=10, b=2 585 65,353 7.68 0.009 0.036 60
780k reddit posts, precision vs k values
K = # hash functions
780k reddit posts, time vs k values
Streaming: Average Time
● Throughput: 315 events/sec, 10 sec time window
● 8 sec/microbatch, 6 nodes,
Conclusion
● Effectively speed up on batch processing
● Use 400-500 hash functions, set the threshold above .65
○ Filter out pairs w/ low similarities
○ Linear scan for pairs w/ 0 neighbors
● Only for Jaccard Similarity.
○ For cosine similarity: LSH + random projection
About Me
● BS, MS in Systems Engineering
(CS minor), UVA
● Operations/Data Science Intern,
Samsung Austin R&D Center
● ML, NLP at scale
● Music, Singing
“We can have a party, just listening to music”
Backup Slides
Limits & Future Work
● Investigate recall values vs parameters/time/...
○ More recall and precision comparison btw Brute-Force and LSH+MinHash
○ More comparison between different parameter comparisons
● Benchmark for batch processing:
○ Size vs Time
● More detailed benchmark on real-time processing
● More runs of experiments:
○ More representative data
● Optimize resource utilization
MapReduce version of MinHash+LSH
● Mapper side: for each post
○ Calculate min hash values
○ Create bands and band hashes
● Reducer side:
○ Get similar items grouped by (band id, band hash)
○ Calculate jaccard similarity on each item combination ->
find the most similar pair
Threshold of MinHash + LSH
● Estimated Similarity Lower bound for each band:
○ ~(1/#bands)^(1/#rows)
● e.g. k =4, 2 bands and 2 rows. at least 0.70 similar
● Collision
● Higher k, more accurate, but slower
Streaming: Kafka
● Throughput: 146 events/sec, 10 ms time window
Streaming: Average Time
● Throughput:120 events/sec, 10 ms time window
● 8 sec/microbatch, 6 nodes, 1024 MB memory/node
Streaming: Kafka
● Throughput: 315.30 events/sec, 10 ms time window
780k reddit posts, precision vs time
Threshold: 0.4-0.5 Threshold: 0.6-0.7 Threshold: 0.8-0.9
CPU usage
●
Task Diagram
●

More Related Content

What's hot

Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks
 

What's hot (20)

Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at Renault
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioBest Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+Alluxio
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Network embedding
Network embeddingNetwork embedding
Network embedding
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Cassandra serving netflix @ scale
Cassandra serving netflix @ scaleCassandra serving netflix @ scale
Cassandra serving netflix @ scale
 
NoSQL Graph Databases - Why, When and Where
NoSQL Graph Databases - Why, When and WhereNoSQL Graph Databases - Why, When and Where
NoSQL Graph Databases - Why, When and Where
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuVirtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
 
PostgreSQL and CockroachDB SQL
PostgreSQL and CockroachDB SQLPostgreSQL and CockroachDB SQL
PostgreSQL and CockroachDB SQL
 
Real-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidReal-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and Druid
 
CockroachDB
CockroachDBCockroachDB
CockroachDB
 

Similar to Benchmark MinHash+LSH algorithm on Spark

Similar to Benchmark MinHash+LSH algorithm on Spark (20)

Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Data Collection and Storage
Data Collection and StorageData Collection and Storage
Data Collection and Storage
 
Deep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDBDeep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDB
 
February 2016 Webinar Series - Introduction to DynamoDB
February 2016 Webinar Series - Introduction to DynamoDBFebruary 2016 Webinar Series - Introduction to DynamoDB
February 2016 Webinar Series - Introduction to DynamoDB
 
Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20
 
Deep Dive - DynamoDB
Deep Dive - DynamoDBDeep Dive - DynamoDB
Deep Dive - DynamoDB
 
AWS Data Collection & Storage
AWS Data Collection & StorageAWS Data Collection & Storage
AWS Data Collection & Storage
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Neo4j after 1 year in production
Neo4j after 1 year in productionNeo4j after 1 year in production
Neo4j after 1 year in production
 
Artmosphere Demo
Artmosphere DemoArtmosphere Demo
Artmosphere Demo
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Deep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDBDeep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDB
 
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDBAWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive
 
Amazon DynamoDB 深入探討
Amazon DynamoDB 深入探討Amazon DynamoDB 深入探討
Amazon DynamoDB 深入探討
 
Amazon Dynamo DB for Developers (김일호) - AWS DB Day
Amazon Dynamo DB for Developers (김일호) - AWS DB DayAmazon Dynamo DB for Developers (김일호) - AWS DB Day
Amazon Dynamo DB for Developers (김일호) - AWS DB Day
 
Druid
DruidDruid
Druid
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
DPF 2017: GPUs in LHCb for Analysis
DPF 2017: GPUs in LHCb for AnalysisDPF 2017: GPUs in LHCb for Analysis
DPF 2017: GPUs in LHCb for Analysis
 

Recently uploaded

Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
gajnagarg
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 

Recently uploaded (20)

Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 

Benchmark MinHash+LSH algorithm on Spark

  • 1. Benchmark MinHash + LSH Algorithm on Spark Insight Data Engineering Fellow Program, Silicon Valley Xiaoqian Liu June 2016
  • 3. Post Recommendation ● Data: Reddit posts and titles in 12/2014 ● Similarity metric: Jaccard Similarity ○ (%common) on titles
  • 4. Pairwise Similarity Calculation is Expensive!! ● ~700k posts in 12/2014 ● Individual lookup: 700K times, O(n) ● Pairwise calculation: 490B times, O(n^2)
  • 5. MinHash: Dimension Reduction Post 1 Dave Grohl tells a story Post 2 Dave Grohl shares a story with Taylor Swift Post 3 I knew it was trouble when they drove by Min hash 1 Min hash 2 Min hash 3 Min hash 4 Post 1 932378 11070 107000 195512 Post 2 20930 213012 107000 195512 Post 3 27698 14136 104464 154376 4 hash funcs
  • 6. LSH (Locality Sensitive Hashing) ● Further reduce the dimension ● Suppose the table is divided into 2 bands w/ width of 2 ● Rehash on each item ● Use (Band id, Band hash) to find similar items Band 1 Band 2 Post 1 Hash (932378,11070) Hash (107000,195512) Post 2 Hash (20930, 213012) Hash (107000,195512) Post 3 Hash (27698,14136) Hash (104464,154376) Dave Grohl Dave Grohl Trouble *Algorithm source: Mining of Massive Datasets (Rajaraman,Leskovec)
  • 7. Infrastructure for Evaluation ● Batch implementation+Eval ● Real-time implementation+Eval Preprocessing (tokenize, remove stopwords) Minhash+LSH (batch version) Minhash+LSH (online version) Reddits (1/2015) Reddits (12/2014) Export LSH+post info Group lookup+update 6 nodes (m4.xlarge) 3 nodes (m4.xlarge) Evaluation & Lookup 6 nodes (m4.xlarge) 1 node (m4.xlarge)
  • 8. Batch Processing Optimization on Spark ● SparkSQL join, cartesian product ● Reduce Shuffle times for joining two different datasets: ○ Co-partition before joining ● Persist the data before actions ○ Storage level depends on the RDD size ● Filter results before joining and calculating similarities ○ filter(), reducebyKey()
  • 9. Batch Processing: Brute-force vs Minhash+LSH (10 hash funcs, 2 bands) 100k entries, 12/2014 Reddits
  • 10. Precision and Recall ● 100k entries, estimated threshold = 0.44 Parameters Items >=threshold Total count Time (sec) Precision Recall num partitions Brute-force 16,046 9.99B 29,880 1 1 3,600 k=10, b=2 585 65,353 7.68 0.009 0.036 60
  • 11. 780k reddit posts, precision vs k values K = # hash functions
  • 12. 780k reddit posts, time vs k values
  • 13. Streaming: Average Time ● Throughput: 315 events/sec, 10 sec time window ● 8 sec/microbatch, 6 nodes,
  • 14. Conclusion ● Effectively speed up on batch processing ● Use 400-500 hash functions, set the threshold above .65 ○ Filter out pairs w/ low similarities ○ Linear scan for pairs w/ 0 neighbors ● Only for Jaccard Similarity. ○ For cosine similarity: LSH + random projection
  • 15. About Me ● BS, MS in Systems Engineering (CS minor), UVA ● Operations/Data Science Intern, Samsung Austin R&D Center ● ML, NLP at scale ● Music, Singing “We can have a party, just listening to music”
  • 17. Limits & Future Work ● Investigate recall values vs parameters/time/... ○ More recall and precision comparison btw Brute-Force and LSH+MinHash ○ More comparison between different parameter comparisons ● Benchmark for batch processing: ○ Size vs Time ● More detailed benchmark on real-time processing ● More runs of experiments: ○ More representative data ● Optimize resource utilization
  • 18. MapReduce version of MinHash+LSH ● Mapper side: for each post ○ Calculate min hash values ○ Create bands and band hashes
  • 19. ● Reducer side: ○ Get similar items grouped by (band id, band hash) ○ Calculate jaccard similarity on each item combination -> find the most similar pair
  • 20. Threshold of MinHash + LSH ● Estimated Similarity Lower bound for each band: ○ ~(1/#bands)^(1/#rows) ● e.g. k =4, 2 bands and 2 rows. at least 0.70 similar ● Collision ● Higher k, more accurate, but slower
  • 21. Streaming: Kafka ● Throughput: 146 events/sec, 10 ms time window
  • 22. Streaming: Average Time ● Throughput:120 events/sec, 10 ms time window ● 8 sec/microbatch, 6 nodes, 1024 MB memory/node
  • 23. Streaming: Kafka ● Throughput: 315.30 events/sec, 10 ms time window
  • 24. 780k reddit posts, precision vs time Threshold: 0.4-0.5 Threshold: 0.6-0.7 Threshold: 0.8-0.9