Practical Data Science Workshop - Recommendation Systems - Collaborative Filtering - Strata NY - 2015

Practical Data Science
on Spark & Hadoop
Collaborative Filtering
Recommendation Systems
Chris Fregly
Principal Data Solutions Engineer
IBM Spark Technology Center

Outline
①  Introduction
②  Live, Interactive, Group Demo!
③  Approximations
④  Similarity
⑤  Recommendations
⑥  Building a Model
⑦  ML Pipelines
⑧  $1 Million Netflix Prize

Live, Interactive, Group Demo!
①  Navigate to sparkafterdark.com
②  Select 3 actresses and 3 actors
③  Wait for me to build the models
https://github.com/fluxcapacitor/pipeline -->

Bloom Filter
7
Approximate set
k-hashes on put/get
False positives
Used all through Spark
From Twitter’s Algebird

Count Min Sketch
8
Approximate counters
Better than HashMap
Low, fixed memory
Known error bounds
Large number of counters
Streaming example in Spark codebase

HyperLogLog
9
Approximate cardinality
Approximate count distinct
Low memory
1.5KB @ 2% error
10^9 elements!
Streaming example in Spark codebase
countApproxDistinctByKey()

Monte Carlo Simulations
1
From Manhattan Project (A-bomb)
Simulate movement of neutrons
Law of Large Numbers (LLN)
Average of results of many trials
Converge on expected value
SparkPi example in Spark codebase
Pi # red dots / # total dots * 4

Euclidean Similarity
Linear measure
Bias toward magnitude

Cosine Similarity
Angle measure
Corrects magnitude bias

Jaccard Similarity
Set Intersection divided by Set Union
Bias towards popularity

Log Likelihood Similarity
Corrects popularity bias

Calculating Similarity
“All-pairs similarity”
“Pair-wise similarity”
“Similarity join”
Naïve impl: O(m*n^2); m=rows, n=cols
Must minimize shuffle and computation

Minimizing Shuffle and Computation
Approximate!
Reduce m (rows)
Sampling
Bucketing (aka. “Partitioning” or “Clustering”)
Removing rows with sparsity below threshold (ie. inactive)
Reduce n (cols)
Remove most frequent value (ie. 0)
Remove least popular

Reduce m (rows): Sampling
DIMSUM
“Dimension Independent Matrix Square Using MR”
Remove rows with low probability of similarity
RowMatrix.columnSimilarities()
Twitter 40% efficiency gain
over naïve cosine similarity ->

Reduce m (rows): Bucketing
LSH
“Locality Sensitive Hashing”
Split m into b buckets w/ similarity hash func()
Requires pre-processing
Compare items within buckets
Comparison is parallelizable
O(m*n^2) -> O(m*n/b*b^2)
O(1.25E17) -> O(1.25E13); b=50

Reduce n (cols)
Remove most frequent values
Replace with (index,value) pairs
O(m*n^2) -> O(m*nnz^2); nnz=number of non-zeros,
Be sure to choose most frequent value – may not be 0!

Recommendation/ML Terminology
User: User seeking recommendations
Item: Item being recommended
Explicit User Feedback: like or rating
Implicit User Feedback: search, click, hover, view, scroll
Instances: Rows of user feedback/input data
Overfitting: Training a model too closely to the training data & hyperparameters
Hold Out Split: Holding out some of the instances to avoid overfitting
Features: Columns of instance rows (of feedback/input data)
Cold Start Problem: Not enough data to personalize (new)
Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations, etc)
Model Evaluation: Compare predictions to actual values of hold out split

Features
Dimensions: Alias for Features
Binary Features: True or False
Numeric Discrete Features: Integers
Numeric Features: Real values
Ordinal Features: Maintains order (S -> M -> L -> XL -> XXL)
Temporal Features: Time-based (Time of Day, Binge Watching)
Categorical Features: Finite, unique set of categories(NFL teams)
Feature Engineering: Modify, reduce, combine features

Feature Engineering
Dimension Reduction: Reduce num features or “feature space”
Principle Component Analysis (PCA): Find principle features that
describe the data
One-Hot Encoding: Convert categorical feature vals to 0’s, 1’s
Bears -> 1 Bears -> 1,0,0
49’ers -> 2 --> 49’ers -> 0,1,0
Steelers-> 3 Steelers-> 0,0,1

Non-Personalized Recommendations
“Cold Start” Problem
Top K Aggregations
Summary Statistics
PageRank
Facebook Graph

Demo!
Top K Aggregations
PageRank

Personalized Recommendations
Collaborative Filtering
User-to-Item
Item-to-Item
Clustering (Similarity)
Users
Items

User-to-Item Collaborative Filtering
Find similar users based on similarity function(s)
Cosine similarity, etc
Recommend items that other similar users have chosen
Exclude items that have already been chosen
Rank items by num of similar users who have chosen
Alternating Least Squares
Matrix Factorization -->

Item-to-Item Collaborative Filtering
Made famous by Amazon ~2003
Couldn’t scale traditional User-to-Item algos
Offline: Generates ItemID::List[CustomerID] vectors
Online: For each item in shopping cart, find similar
items based on closest List[CustomerID] vector

User and Item Clustering (Similarity)
Based on Similarity
ie. Similar Profile/Description Text or Categories
LDA Topic, K-Means, Nearest Neighbor, Eigenfaces, PCA

Streaming K Means Clustering
Initial set of k clusters with random centers
Incoming data:
Assign to closest cluster: distance to center
Update centers: minimize within-cluster-sum-of-squares
Half-life decay factor
Reduce contribution of old data to half -->
Measured in num batches or num data points
Eliminate dead clusters never assigned new data
Split existing cluster and join with dead cluster -->

Demo!
Alternating Least Squares
Matrix Factorization

Split Instance Data
3 Roles
Model Training (80%)
Model Validation (10%)
Model Testing (10%)
k-folds Cross Validation
Divide instances into k sections
Alternate each k section between 3 roles above
http://www.slideshare.net/SebastianRaschka/musicmood-20140912

Hyperparameter Selection
Select sets of values for each hyperparameter
Use GridSearch to find best combo to reduce error
Avoid overfitting!
http://www.slideshare.net/ogrisel/strategies-and-tools-for-parallel-machine-learning-in-python

Evaluation Criteria
Regression (Distance has meaning)
Root Mean Square Error (RMSE)
Mean Absolute Error (MAE)
Categorical (Distance does not have meaning)
Precision/Accuracy

ML Pipelines
Inspired by scikit-learn
Transformers
transform() input for estimation (training)
predict() new input
Estimators
fit() a model to the transformed dataset (training)
Pipeline
Chain everything together

$1 Million Netflix Prize
October, 2006 --> Sept 2009 (3 years!!)
Winning algorithm beat Netflix by 10.06% based on RMSE
Ensemble of 500+ models
Combined using Gradient Boosted Decision Trees
Computationally intensive and impractical

Winning Algorithm Adjustments
“Alice effect”: Alice tends to rate lower than the average user
“Inception effect”: Inception is rate higher than average movie
“Alice-Inception effect”: Combo of Alice and Inception
Number of days since a user’s first rating
Number of days since a movie’s first rating
Number of people who have rated a movie
A movie’s overall mean rating
Factor these out and find the baseline!

Thanks!
Chris Fregly
@cfregly
References
①  https://github.com/fluxcapacitor/pipeline
②  http://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf
③  http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
④  http://spark.apache.org/docs/latest/ml-guide.html

Practical Data Science Workshop - Recommendation Systems - Collaborative Filtering - Strata NY - 2015

More Related Content

What's hot

Viewers also liked

Similar to Practical Data Science Workshop - Recommendation Systems - Collaborative Filtering - Strata NY - 2015

More from Chris Fregly

Recently uploaded

Practical Data Science Workshop - Recommendation Systems - Collaborative Filtering - Strata NY - 2015