Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Practical Data Science
on Spark & Hadoop
Collaborative Filtering
Recommendation Systems
Chris Fregly
Principal Data Soluti...
Outline
①  Introduction
②  Live, Interactive, Group Demo!
③  Approximations
④  Similarity
⑤  Recommendations
⑥  Building a...
Who am I?
Outline
①  Introduction
②  Live, Interactive, Group Demo!
③  Approximations
④  Similarity
⑤  Recommendations
⑥  Building a...
Live, Interactive, Group Demo!
①  Navigate to sparkafterdark.com
②  Select 3 actresses and 3 actors
③  Wait for me to buil...
Outline
①  Introduction
②  Live, Interactive, Group Demo!
③  Approximations
④  Similarity
⑤  Recommendations
⑥  Building a...
Bloom Filter
7
Approximate set
k-hashes on put/get
False positives
Used all through Spark
From Twitter’s Algebird
Count Min Sketch
8
Approximate counters
Better than HashMap
Low, fixed memory
Known error bounds
Large number of counters
...
HyperLogLog
9
Approximate cardinality
Approximate count distinct
Low memory
1.5KB @ 2% error
10^9 elements!
From Twitter’s...
Monte Carlo Simulations
1
From Manhattan Project (A-bomb)
Simulate movement of neutrons
Law of Large Numbers (LLN)
Average...
Demo!
Monte Carlo Simulation
Outline
①  Introduction
②  Live, Interactive, Group Demo!
③  Approximations
④  Similarity
⑤  Recommendations
⑥  Building a...
Euclidean Similarity
Linear measure
Bias toward magnitude
Cosine Similarity
Angle measure
Corrects magnitude bias
Jaccard Similarity
Set Intersection divided by Set Union
Bias towards popularity
Log Likelihood Similarity
Corrects popularity bias
Calculating Similarity
“All-pairs similarity”
“Pair-wise similarity”
“Similarity join”
Naïve impl: O(m*n^2); m=rows, n=col...
Minimizing Shuffle and Computation
Approximate!
Reduce m (rows)
Sampling
Bucketing (aka. “Partitioning” or “Clustering”)
R...
Reduce m (rows): Sampling
DIMSUM
“Dimension Independent Matrix Square Using MR”
Remove rows with low probability of simila...
Reduce m (rows): Bucketing
LSH
“Locality Sensitive Hashing”
Split m into b buckets w/ similarity hash func()
Requires pre-...
Reduce n (cols)
Remove most frequent values
Replace with (index,value) pairs
O(m*n^2) -> O(m*nnz^2); nnz=number of non-zer...
Outline
①  Introduction
②  Live, Interactive, Group Demo!
③  Approximations
④  Similarity
⑤  Recommendations
⑥  Building a...
Recommendation/ML Terminology
User: User seeking recommendations
Item: Item being recommended
Explicit User Feedback: like...
Features
Dimensions: Alias for Features
Binary Features: True or False
Numeric Discrete Features: Integers
Numeric Feature...
Feature Engineering
Dimension Reduction: Reduce num features or “feature space”
Principle Component Analysis (PCA): Find p...
Non-Personalized Recommendations
“Cold Start” Problem
Top K Aggregations
Summary Statistics
PageRank
Facebook Graph
Demo!
Top K Aggregations
PageRank
Personalized Recommendations
Collaborative Filtering
User-to-Item
Item-to-Item
Clustering (Similarity)
Users
Items
User-to-Item Collaborative Filtering
Find similar users based on similarity function(s)
Cosine similarity, etc
Recommend i...
Matrix Factorization
Item-to-Item Collaborative Filtering
Made famous by Amazon ~2003
Couldn’t scale traditional User-to-Item algos
Offline: Ge...
User and Item Clustering (Similarity)
Based on Similarity
ie. Similar Profile/Description Text or Categories
LDA Topic, K-...
Streaming K Means Clustering
Initial set of k clusters with random centers
Incoming data:
Assign to closest cluster: dista...
Demo!
Alternating Least Squares
Matrix Factorization
Outline
①  Introduction
②  Live, Interactive, Group Demo!
③  Approximations
④  Similarity
⑤  Recommendations
⑥  Building a...
Split Instance Data
3 Roles
Model Training (80%)
Model Validation (10%)
Model Testing (10%)
k-folds Cross Validation
Divid...
Hyperparameter Selection
Select sets of values for each hyperparameter
Use GridSearch to find best combo to reduce error
A...
Evaluation Criteria
Regression (Distance has meaning)
Root Mean Square Error (RMSE)
Mean Absolute Error (MAE)
Categorical ...
Outline
①  Introduction
②  Live, Interactive, Group Demo!
③  Approximations
④  Similarity
⑤  Recommendations
⑥  Building a...
ML Pipelines
Inspired by scikit-learn
Transformers
transform() input for estimation (training)
predict() new input
Estimat...
Outline
①  Introduction
②  Live, Interactive, Group Demo!
③  Approximations
④  Similarity
⑤  Recommendations
⑥  Building a...
$1 Million Netflix Prize
October, 2006 --> Sept 2009 (3 years!!)
Winning algorithm beat Netflix by 10.06% based on RMSE
En...
Winning Algorithm Adjustments
“Alice effect”: Alice tends to rate lower than the average user
“Inception effect”: Inceptio...
Thanks!
Chris Fregly
@cfregly
References
①  https://github.com/fluxcapacitor/pipeline
②  http://www.cs.umd.edu/~samir/498/...
Upcoming SlideShare
Loading in …5
×

Practical Data Science Workshop - Recommendation Systems - Collaborative Filtering - Strata NY - 2015

1,077 views

Published on

Practical Data Science Workshop - Recommendation Systems - Collaborative Filtering - Strata NY - 2015

Published in: Software
  • Be the first to comment

Practical Data Science Workshop - Recommendation Systems - Collaborative Filtering - Strata NY - 2015

  1. 1. Practical Data Science on Spark & Hadoop Collaborative Filtering Recommendation Systems Chris Fregly Principal Data Solutions Engineer IBM Spark Technology Center
  2. 2. Outline ①  Introduction ②  Live, Interactive, Group Demo! ③  Approximations ④  Similarity ⑤  Recommendations ⑥  Building a Model ⑦  ML Pipelines ⑧  $1 Million Netflix Prize
  3. 3. Who am I?
  4. 4. Outline ①  Introduction ②  Live, Interactive, Group Demo! ③  Approximations ④  Similarity ⑤  Recommendations ⑥  Building a Model ⑦  ML Pipelines ⑧  $1 Million Netflix Prize
  5. 5. Live, Interactive, Group Demo! ①  Navigate to sparkafterdark.com ②  Select 3 actresses and 3 actors ③  Wait for me to build the models https://github.com/fluxcapacitor/pipeline -->
  6. 6. Outline ①  Introduction ②  Live, Interactive, Group Demo! ③  Approximations ④  Similarity ⑤  Recommendations ⑥  Building a Model ⑦  ML Pipelines ⑧  $1 Million Netflix Prize
  7. 7. Bloom Filter 7 Approximate set k-hashes on put/get False positives Used all through Spark From Twitter’s Algebird
  8. 8. Count Min Sketch 8 Approximate counters Better than HashMap Low, fixed memory Known error bounds Large number of counters From Twitter’s Algebird Streaming example in Spark codebase
  9. 9. HyperLogLog 9 Approximate cardinality Approximate count distinct Low memory 1.5KB @ 2% error 10^9 elements! From Twitter’s Algebird Streaming example in Spark codebase countApproxDistinctByKey()
  10. 10. Monte Carlo Simulations 1 From Manhattan Project (A-bomb) Simulate movement of neutrons Law of Large Numbers (LLN) Average of results of many trials Converge on expected value SparkPi example in Spark codebase Pi # red dots / # total dots * 4
  11. 11. Demo! Monte Carlo Simulation
  12. 12. Outline ①  Introduction ②  Live, Interactive, Group Demo! ③  Approximations ④  Similarity ⑤  Recommendations ⑥  Building a Model ⑦  ML Pipelines ⑧  $1 Million Netflix Prize
  13. 13. Euclidean Similarity Linear measure Bias toward magnitude
  14. 14. Cosine Similarity Angle measure Corrects magnitude bias
  15. 15. Jaccard Similarity Set Intersection divided by Set Union Bias towards popularity
  16. 16. Log Likelihood Similarity Corrects popularity bias
  17. 17. Calculating Similarity “All-pairs similarity” “Pair-wise similarity” “Similarity join” Naïve impl: O(m*n^2); m=rows, n=cols Must minimize shuffle and computation
  18. 18. Minimizing Shuffle and Computation Approximate! Reduce m (rows) Sampling Bucketing (aka. “Partitioning” or “Clustering”) Removing rows with sparsity below threshold (ie. inactive) Reduce n (cols) Remove most frequent value (ie. 0) Remove least popular
  19. 19. Reduce m (rows): Sampling DIMSUM “Dimension Independent Matrix Square Using MR” Remove rows with low probability of similarity RowMatrix.columnSimilarities() Twitter 40% efficiency gain over naïve cosine similarity ->
  20. 20. Reduce m (rows): Bucketing LSH “Locality Sensitive Hashing” Split m into b buckets w/ similarity hash func() Requires pre-processing Compare items within buckets Comparison is parallelizable O(m*n^2) -> O(m*n/b*b^2) O(1.25E17) -> O(1.25E13); b=50
  21. 21. Reduce n (cols) Remove most frequent values Replace with (index,value) pairs O(m*n^2) -> O(m*nnz^2); nnz=number of non-zeros, Be sure to choose most frequent value – may not be 0!
  22. 22. Outline ①  Introduction ②  Live, Interactive, Group Demo! ③  Approximations ④  Similarity ⑤  Recommendations ⑥  Building a Model ⑦  ML Pipelines ⑧  $1 Million Netflix Prize
  23. 23. Recommendation/ML Terminology User: User seeking recommendations Item: Item being recommended Explicit User Feedback: like or rating Implicit User Feedback: search, click, hover, view, scroll Instances: Rows of user feedback/input data Overfitting: Training a model too closely to the training data & hyperparameters Hold Out Split: Holding out some of the instances to avoid overfitting Features: Columns of instance rows (of feedback/input data) Cold Start Problem: Not enough data to personalize (new) Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations, etc) Model Evaluation: Compare predictions to actual values of hold out split
  24. 24. Features Dimensions: Alias for Features Binary Features: True or False Numeric Discrete Features: Integers Numeric Features: Real values Ordinal Features: Maintains order (S -> M -> L -> XL -> XXL) Temporal Features: Time-based (Time of Day, Binge Watching) Categorical Features: Finite, unique set of categories(NFL teams) Feature Engineering: Modify, reduce, combine features
  25. 25. Feature Engineering Dimension Reduction: Reduce num features or “feature space” Principle Component Analysis (PCA): Find principle features that describe the data One-Hot Encoding: Convert categorical feature vals to 0’s, 1’s Bears -> 1 Bears -> 1,0,0 49’ers -> 2 --> 49’ers -> 0,1,0 Steelers-> 3 Steelers-> 0,0,1
  26. 26. Non-Personalized Recommendations “Cold Start” Problem Top K Aggregations Summary Statistics PageRank Facebook Graph
  27. 27. Demo! Top K Aggregations PageRank
  28. 28. Personalized Recommendations Collaborative Filtering User-to-Item Item-to-Item Clustering (Similarity) Users Items
  29. 29. User-to-Item Collaborative Filtering Find similar users based on similarity function(s) Cosine similarity, etc Recommend items that other similar users have chosen Exclude items that have already been chosen Rank items by num of similar users who have chosen Alternating Least Squares Matrix Factorization -->
  30. 30. Matrix Factorization
  31. 31. Item-to-Item Collaborative Filtering Made famous by Amazon ~2003 Couldn’t scale traditional User-to-Item algos Offline: Generates ItemID::List[CustomerID] vectors Online: For each item in shopping cart, find similar items based on closest List[CustomerID] vector
  32. 32. User and Item Clustering (Similarity) Based on Similarity ie. Similar Profile/Description Text or Categories LDA Topic, K-Means, Nearest Neighbor, Eigenfaces, PCA
  33. 33. Streaming K Means Clustering Initial set of k clusters with random centers Incoming data: Assign to closest cluster: distance to center Update centers: minimize within-cluster-sum-of-squares Half-life decay factor Reduce contribution of old data to half --> Measured in num batches or num data points Eliminate dead clusters never assigned new data Split existing cluster and join with dead cluster -->
  34. 34. Demo! Alternating Least Squares Matrix Factorization
  35. 35. Outline ①  Introduction ②  Live, Interactive, Group Demo! ③  Approximations ④  Similarity ⑤  Recommendations ⑥  Building a Model ⑦  ML Pipelines ⑧  $1 Million Netflix Prize
  36. 36. Split Instance Data 3 Roles Model Training (80%) Model Validation (10%) Model Testing (10%) k-folds Cross Validation Divide instances into k sections Alternate each k section between 3 roles above http://www.slideshare.net/SebastianRaschka/musicmood-20140912
  37. 37. Hyperparameter Selection Select sets of values for each hyperparameter Use GridSearch to find best combo to reduce error Avoid overfitting! http://www.slideshare.net/ogrisel/strategies-and-tools-for-parallel-machine-learning-in-python
  38. 38. Evaluation Criteria Regression (Distance has meaning) Root Mean Square Error (RMSE) Mean Absolute Error (MAE) Categorical (Distance does not have meaning) Precision/Accuracy
  39. 39. Outline ①  Introduction ②  Live, Interactive, Group Demo! ③  Approximations ④  Similarity ⑤  Recommendations ⑥  Building a Model ⑦  ML Pipelines ⑧  $1 Million Netflix Prize
  40. 40. ML Pipelines Inspired by scikit-learn Transformers transform() input for estimation (training) predict() new input Estimators fit() a model to the transformed dataset (training) Pipeline Chain everything together
  41. 41. Outline ①  Introduction ②  Live, Interactive, Group Demo! ③  Approximations ④  Similarity ⑤  Recommendations ⑥  Building a Model ⑦  ML Pipelines ⑧  $1 Million Netflix Prize
  42. 42. $1 Million Netflix Prize October, 2006 --> Sept 2009 (3 years!!) Winning algorithm beat Netflix by 10.06% based on RMSE Ensemble of 500+ models Combined using Gradient Boosted Decision Trees Computationally intensive and impractical
  43. 43. Winning Algorithm Adjustments “Alice effect”: Alice tends to rate lower than the average user “Inception effect”: Inception is rate higher than average movie “Alice-Inception effect”: Combo of Alice and Inception Number of days since a user’s first rating Number of days since a movie’s first rating Number of people who have rated a movie A movie’s overall mean rating Factor these out and find the baseline!
  44. 44. Thanks! Chris Fregly @cfregly References ①  https://github.com/fluxcapacitor/pipeline ②  http://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf ③  http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/ ④  http://spark.apache.org/docs/latest/ml-guide.html

×