Practical Data Science Workshop - Recommendation Systems - Collaborative Filtering - Strata NY - 2015

Chris Fregly
Chris FreglyAI and Machine Learning @ AWS, O'Reilly Author @ Data Science on AWS, Founder @ PipelineAI, Formerly Databricks, Netflix,
Practical Data Science
on Spark & Hadoop
Collaborative Filtering
Recommendation Systems
Chris Fregly
Principal Data Solutions Engineer
IBM Spark Technology Center
Outline
①  Introduction
②  Live, Interactive, Group Demo!
③  Approximations
④  Similarity
⑤  Recommendations
⑥  Building a Model
⑦  ML Pipelines
⑧  $1 Million Netflix Prize
Who am I?
Outline
①  Introduction
②  Live, Interactive, Group Demo!
③  Approximations
④  Similarity
⑤  Recommendations
⑥  Building a Model
⑦  ML Pipelines
⑧  $1 Million Netflix Prize
Live, Interactive, Group Demo!
①  Navigate to sparkafterdark.com
②  Select 3 actresses and 3 actors
③  Wait for me to build the models
https://github.com/fluxcapacitor/pipeline -->
Outline
①  Introduction
②  Live, Interactive, Group Demo!
③  Approximations
④  Similarity
⑤  Recommendations
⑥  Building a Model
⑦  ML Pipelines
⑧  $1 Million Netflix Prize
Bloom Filter
7
Approximate set
k-hashes on put/get
False positives
Used all through Spark
From Twitter’s Algebird
Count Min Sketch
8
Approximate counters
Better than HashMap
Low, fixed memory
Known error bounds
Large number of counters
From Twitter’s Algebird
Streaming example in Spark codebase
HyperLogLog
9
Approximate cardinality
Approximate count distinct
Low memory
1.5KB @ 2% error
10^9 elements!
From Twitter’s Algebird
Streaming example in Spark codebase
countApproxDistinctByKey()
Monte Carlo Simulations
1
From Manhattan Project (A-bomb)
Simulate movement of neutrons
Law of Large Numbers (LLN)
Average of results of many trials
Converge on expected value
SparkPi example in Spark codebase
Pi # red dots / # total dots * 4
Demo!
Monte Carlo Simulation
Outline
①  Introduction
②  Live, Interactive, Group Demo!
③  Approximations
④  Similarity
⑤  Recommendations
⑥  Building a Model
⑦  ML Pipelines
⑧  $1 Million Netflix Prize
Euclidean Similarity
Linear measure
Bias toward magnitude
Cosine Similarity
Angle measure
Corrects magnitude bias
Jaccard Similarity
Set Intersection divided by Set Union
Bias towards popularity
Log Likelihood Similarity
Corrects popularity bias
Calculating Similarity
“All-pairs similarity”
“Pair-wise similarity”
“Similarity join”
Naïve impl: O(m*n^2); m=rows, n=cols
Must minimize shuffle and computation
Minimizing Shuffle and Computation
Approximate!
Reduce m (rows)
Sampling
Bucketing (aka. “Partitioning” or “Clustering”)
Removing rows with sparsity below threshold (ie. inactive)
Reduce n (cols)
Remove most frequent value (ie. 0)
Remove least popular
Reduce m (rows): Sampling
DIMSUM
“Dimension Independent Matrix Square Using MR”
Remove rows with low probability of similarity
RowMatrix.columnSimilarities()
Twitter 40% efficiency gain
over naïve cosine similarity ->
Reduce m (rows): Bucketing
LSH
“Locality Sensitive Hashing”
Split m into b buckets w/ similarity hash func()
Requires pre-processing
Compare items within buckets
Comparison is parallelizable
O(m*n^2) -> O(m*n/b*b^2)
O(1.25E17) -> O(1.25E13); b=50
Reduce n (cols)
Remove most frequent values
Replace with (index,value) pairs
O(m*n^2) -> O(m*nnz^2); nnz=number of non-zeros,
Be sure to choose most frequent value – may not be 0!
Outline
①  Introduction
②  Live, Interactive, Group Demo!
③  Approximations
④  Similarity
⑤  Recommendations
⑥  Building a Model
⑦  ML Pipelines
⑧  $1 Million Netflix Prize
Recommendation/ML Terminology
User: User seeking recommendations
Item: Item being recommended
Explicit User Feedback: like or rating
Implicit User Feedback: search, click, hover, view, scroll
Instances: Rows of user feedback/input data
Overfitting: Training a model too closely to the training data & hyperparameters
Hold Out Split: Holding out some of the instances to avoid overfitting
Features: Columns of instance rows (of feedback/input data)
Cold Start Problem: Not enough data to personalize (new)
Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations, etc)
Model Evaluation: Compare predictions to actual values of hold out split
Features
Dimensions: Alias for Features
Binary Features: True or False
Numeric Discrete Features: Integers
Numeric Features: Real values
Ordinal Features: Maintains order (S -> M -> L -> XL -> XXL)
Temporal Features: Time-based (Time of Day, Binge Watching)
Categorical Features: Finite, unique set of categories(NFL teams)
Feature Engineering: Modify, reduce, combine features
Feature Engineering
Dimension Reduction: Reduce num features or “feature space”
Principle Component Analysis (PCA): Find principle features that
describe the data
One-Hot Encoding: Convert categorical feature vals to 0’s, 1’s
Bears -> 1 Bears -> 1,0,0
49’ers -> 2 --> 49’ers -> 0,1,0
Steelers-> 3 Steelers-> 0,0,1
Non-Personalized Recommendations
“Cold Start” Problem
Top K Aggregations
Summary Statistics
PageRank
Facebook Graph
Demo!
Top K Aggregations
PageRank
Personalized Recommendations
Collaborative Filtering
User-to-Item
Item-to-Item
Clustering (Similarity)
Users
Items
User-to-Item Collaborative Filtering
Find similar users based on similarity function(s)
Cosine similarity, etc
Recommend items that other similar users have chosen
Exclude items that have already been chosen
Rank items by num of similar users who have chosen
Alternating Least Squares
Matrix Factorization -->
Matrix Factorization
Item-to-Item Collaborative Filtering
Made famous by Amazon ~2003
Couldn’t scale traditional User-to-Item algos
Offline: Generates ItemID::List[CustomerID] vectors
Online: For each item in shopping cart, find similar
items based on closest List[CustomerID] vector
User and Item Clustering (Similarity)
Based on Similarity
ie. Similar Profile/Description Text or Categories
LDA Topic, K-Means, Nearest Neighbor, Eigenfaces, PCA
Streaming K Means Clustering
Initial set of k clusters with random centers
Incoming data:
Assign to closest cluster: distance to center
Update centers: minimize within-cluster-sum-of-squares
Half-life decay factor
Reduce contribution of old data to half -->
Measured in num batches or num data points
Eliminate dead clusters never assigned new data
Split existing cluster and join with dead cluster -->
Demo!
Alternating Least Squares
Matrix Factorization
Outline
①  Introduction
②  Live, Interactive, Group Demo!
③  Approximations
④  Similarity
⑤  Recommendations
⑥  Building a Model
⑦  ML Pipelines
⑧  $1 Million Netflix Prize
Split Instance Data
3 Roles
Model Training (80%)
Model Validation (10%)
Model Testing (10%)
k-folds Cross Validation
Divide instances into k sections
Alternate each k section between 3 roles above
http://www.slideshare.net/SebastianRaschka/musicmood-20140912
Hyperparameter Selection
Select sets of values for each hyperparameter
Use GridSearch to find best combo to reduce error
Avoid overfitting!
http://www.slideshare.net/ogrisel/strategies-and-tools-for-parallel-machine-learning-in-python
Evaluation Criteria
Regression (Distance has meaning)
Root Mean Square Error (RMSE)
Mean Absolute Error (MAE)
Categorical (Distance does not have meaning)
Precision/Accuracy
Outline
①  Introduction
②  Live, Interactive, Group Demo!
③  Approximations
④  Similarity
⑤  Recommendations
⑥  Building a Model
⑦  ML Pipelines
⑧  $1 Million Netflix Prize
ML Pipelines
Inspired by scikit-learn
Transformers
transform() input for estimation (training)
predict() new input
Estimators
fit() a model to the transformed dataset (training)
Pipeline
Chain everything together
Outline
①  Introduction
②  Live, Interactive, Group Demo!
③  Approximations
④  Similarity
⑤  Recommendations
⑥  Building a Model
⑦  ML Pipelines
⑧  $1 Million Netflix Prize
$1 Million Netflix Prize
October, 2006 --> Sept 2009 (3 years!!)
Winning algorithm beat Netflix by 10.06% based on RMSE
Ensemble of 500+ models
Combined using Gradient Boosted Decision Trees
Computationally intensive and impractical
Winning Algorithm Adjustments
“Alice effect”: Alice tends to rate lower than the average user
“Inception effect”: Inception is rate higher than average movie
“Alice-Inception effect”: Combo of Alice and Inception
Number of days since a user’s first rating
Number of days since a movie’s first rating
Number of people who have rated a movie
A movie’s overall mean rating
Factor these out and find the baseline!
Thanks!
Chris Fregly
@cfregly
References
①  https://github.com/fluxcapacitor/pipeline
②  http://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf
③  http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
④  http://spark.apache.org/docs/latest/ml-guide.html
1 of 44

Recommended

Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ... by
Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...Chris Fregly
2K views56 slides
Dublin Ireland Spark Meetup October 15, 2015 by
Dublin Ireland Spark Meetup October 15, 2015Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015Chris Fregly
729 views59 slides
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl... by
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...Chris Fregly
2.1K views59 slides
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark... by
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Chris Fregly
2.2K views55 slides
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark by
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with SparkChris Fregly
6.1K views55 slides
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc... by
Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...Chris Fregly
793 views121 slides

More Related Content

What's hot

Advanced Apache Spark Meetup Project Tungsten Nov 12 2015 by
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Chris Fregly
12.1K views60 slides
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016 by
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016Chris Fregly
1.6K views74 slides
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar... by
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...Chris Fregly
3.4K views85 slides
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016 by
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Chris Fregly
887 views117 slides
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures... by
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Chris Fregly
1.6K views39 slides
Spark Summit East NYC Meetup 02-16-2016 by
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016 Chris Fregly
1.1K views82 slides

What's hot(20)

Advanced Apache Spark Meetup Project Tungsten Nov 12 2015 by Chris Fregly
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Chris Fregly12.1K views
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016 by Chris Fregly
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
Chris Fregly1.6K views
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar... by Chris Fregly
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Chris Fregly3.4K views
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016 by Chris Fregly
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Chris Fregly887 views
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures... by Chris Fregly
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Chris Fregly1.6K views
Spark Summit East NYC Meetup 02-16-2016 by Chris Fregly
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016
Chris Fregly1.1K views
Helsinki Spark Meetup Nov 20 2015 by Chris Fregly
Helsinki Spark Meetup Nov 20 2015Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015
Chris Fregly899 views
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015 by Chris Fregly
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Chris Fregly1.1K views
Toronto Spark Meetup Dec 14 2015 by Chris Fregly
Toronto Spark Meetup Dec 14 2015Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015
Chris Fregly1.3K views
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5 by Chris Fregly
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Chris Fregly665 views
DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations by Chris Fregly
DC Spark Users Group March 15 2016 - Spark and Netflix RecommendationsDC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
Chris Fregly1.7K views
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ... by Chris Fregly
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Chris Fregly1.6K views
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016 by Chris Fregly
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Chris Fregly1.4K views
Copenhagen Spark Meetup Nov 25, 2015 by Chris Fregly
Copenhagen Spark Meetup Nov 25, 2015Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015
Chris Fregly770 views
Dallas DFW Data Science Meetup Jan 21 2016 by Chris Fregly
Dallas DFW Data Science Meetup Jan 21 2016Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016
Chris Fregly505 views
Melbourne Spark Meetup Dec 09 2015 by Chris Fregly
Melbourne Spark Meetup Dec 09 2015Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015
Chris Fregly533 views
Singapore Spark Meetup Dec 01 2015 by Chris Fregly
Singapore Spark Meetup Dec 01 2015Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015
Chris Fregly1.1K views
Boston Spark Meetup May 24, 2016 by Chris Fregly
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016
Chris Fregly2.1K views
Sydney Spark Meetup Dec 08, 2015 by Chris Fregly
Sydney Spark Meetup Dec 08, 2015Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015
Chris Fregly539 views
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass... by Chris Fregly
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Chris Fregly4.8K views

Viewers also liked

Tag based recommender system by
Tag based recommender systemTag based recommender system
Tag based recommender systemKaren Li
8.8K views115 slides
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f... by
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Varad Meru
3.4K views24 slides
Recommender system introduction by
Recommender system introductionRecommender system introduction
Recommender system introductionQing Liu
2K views80 slides
Using Interaction Signals for Job Recommendation by
Using Interaction Signals for Job RecommendationUsing Interaction Signals for Job Recommendation
Using Interaction Signals for Job Recommendationkib_83
762 views26 slides
Item Based Collaborative Filtering Recommendation Algorithms by
Item Based Collaborative Filtering Recommendation AlgorithmsItem Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation Algorithmsnextlib
21.6K views48 slides
Overview of recommender system by
Overview of recommender systemOverview of recommender system
Overview of recommender systemStanley Wang
6.9K views59 slides

Viewers also liked(10)

Tag based recommender system by Karen Li
Tag based recommender systemTag based recommender system
Tag based recommender system
Karen Li8.8K views
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f... by Varad Meru
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Varad Meru3.4K views
Recommender system introduction by Qing Liu
Recommender system introductionRecommender system introduction
Recommender system introduction
Qing Liu2K views
Using Interaction Signals for Job Recommendation by kib_83
Using Interaction Signals for Job RecommendationUsing Interaction Signals for Job Recommendation
Using Interaction Signals for Job Recommendation
kib_83762 views
Item Based Collaborative Filtering Recommendation Algorithms by nextlib
Item Based Collaborative Filtering Recommendation AlgorithmsItem Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation Algorithms
nextlib21.6K views
Overview of recommender system by Stanley Wang
Overview of recommender systemOverview of recommender system
Overview of recommender system
Stanley Wang6.9K views
How to build a Recommender System by Võ Duy Tuấn
How to build a Recommender SystemHow to build a Recommender System
How to build a Recommender System
Võ Duy Tuấn8.8K views
A Combination of Simple Models by Forward Predictor Selection for Job Recomme... by David Zibriczky
A Combination of Simple Models by Forward Predictor Selection for Job Recomme...A Combination of Simple Models by Forward Predictor Selection for Job Recomme...
A Combination of Simple Models by Forward Predictor Selection for Job Recomme...
David Zibriczky2K views
AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio... by Amazon Web Services
AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...
AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...
Amazon Web Services3.1K views
Amazon Item-to-Item Recommendations by Roger Chen
Amazon Item-to-Item RecommendationsAmazon Item-to-Item Recommendations
Amazon Item-to-Item Recommendations
Roger Chen11.7K views

Similar to Practical Data Science Workshop - Recommendation Systems - Collaborative Filtering - Strata NY - 2015

Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks by
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksData Con LA
906 views55 slides
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu... by
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...In-Memory Computing Summit
1.2K views55 slides
Tensors Are All You Need: Faster Inference with Hummingbird by
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
267 views49 slides
MLlib: Spark's Machine Learning Library by
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Libraryjeykottalam
9.9K views33 slides
The Other HPC: High Productivity Computing in Polystore Environments by
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsUniversity of Washington
436 views79 slides
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley by
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDatabricks
1.4K views33 slides

Similar to Practical Data Science Workshop - Recommendation Systems - Collaborative Filtering - Strata NY - 2015(20)

Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks by Data Con LA
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Data Con LA906 views
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu... by In-Memory Computing Summit
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...
Tensors Are All You Need: Faster Inference with Hummingbird by Databricks
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
Databricks267 views
MLlib: Spark's Machine Learning Library by jeykottalam
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
jeykottalam9.9K views
The Other HPC: High Productivity Computing in Polystore Environments by University of Washington
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley by Databricks
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Databricks1.4K views
What to do when one size does not fit all?! by Arjen de Vries
What to do when one size does not fit all?!What to do when one size does not fit all?!
What to do when one size does not fit all?!
Arjen de Vries649 views
Automated Hyperparameter Tuning, Scaling and Tracking by Databricks
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
Databricks2.3K views
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ... by Jose Quesada (hiring)
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)10.1K views
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp... by MongoDB
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
MongoDB5.7K views
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N... by Karthik Murugesan
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Karthik Murugesan119 views
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N... by Databricks
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Databricks1K views
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap... by Data Con LA
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Data Con LA643 views
Augmenting Machine Learning with Databricks Labs AutoML Toolkit by Databricks
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Databricks558 views
Performance By Design by Guy Harrison
Performance By DesignPerformance By Design
Performance By Design
Guy Harrison624 views
Azure Databricks for Data Scientists by Richard Garris
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
Richard Garris281 views
Sparking Science up with Research Recommendations by Maya Hristakeva by Spark Summit
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya Hristakeva
Spark Summit2.4K views
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines by Philip Goddard
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesRevolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Philip Goddard328 views
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks... by Rodney Joyce
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Rodney Joyce453 views

More from Chris Fregly

AWS reInvent 2022 reCap AI/ML and Data by
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataChris Fregly
346 views79 slides
Pandas on AWS - Let me count the ways.pdf by
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfChris Fregly
191 views32 slides
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated by
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedChris Fregly
1.9K views15 slides
Amazon reInvent 2020 Recap: AI and Machine Learning by
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine LearningChris Fregly
1.2K views25 slides
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod... by
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...Chris Fregly
900 views39 slides
Quantum Computing with Amazon Braket by
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon BraketChris Fregly
1K views35 slides

More from Chris Fregly(20)

AWS reInvent 2022 reCap AI/ML and Data by Chris Fregly
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and Data
Chris Fregly346 views
Pandas on AWS - Let me count the ways.pdf by Chris Fregly
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdf
Chris Fregly191 views
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated by Chris Fregly
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Chris Fregly1.9K views
Amazon reInvent 2020 Recap: AI and Machine Learning by Chris Fregly
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
Chris Fregly1.2K views
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod... by Chris Fregly
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Chris Fregly900 views
Quantum Computing with Amazon Braket by Chris Fregly
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
Chris Fregly1K views
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person by Chris Fregly
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
Chris Fregly2.6K views
AWS Re:Invent 2019 Re:Cap by Chris Fregly
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:Cap
Chris Fregly2.1K views
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo... by Chris Fregly
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
Chris Fregly3.9K views
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -... by Chris Fregly
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Chris Fregly1.2K views
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ... by Chris Fregly
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Chris Fregly3.7K views
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T... by Chris Fregly
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Chris Fregly597 views
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -... by Chris Fregly
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
Chris Fregly1.1K views
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer... by Chris Fregly
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
Chris Fregly607 views
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ... by Chris Fregly
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Chris Fregly5.3K views
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to... by Chris Fregly
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
Chris Fregly2.5K views
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern... by Chris Fregly
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Chris Fregly963 views
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -... by Chris Fregly
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Chris Fregly3.9K views
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +... by Chris Fregly
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
Chris Fregly1.4K views
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S... by Chris Fregly
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
Chris Fregly2.5K views

Recently uploaded

Airline Booking Software by
Airline Booking SoftwareAirline Booking Software
Airline Booking SoftwareSharmiMehta
9 views26 slides
Introduction to Gradle by
Introduction to GradleIntroduction to Gradle
Introduction to GradleJohn Valentino
5 views7 slides
nintendo_64.pptx by
nintendo_64.pptxnintendo_64.pptx
nintendo_64.pptxpaiga02016
6 views7 slides
Programming Field by
Programming FieldProgramming Field
Programming Fieldthehardtechnology
6 views9 slides
Bootstrapping vs Venture Capital.pptx by
Bootstrapping vs Venture Capital.pptxBootstrapping vs Venture Capital.pptx
Bootstrapping vs Venture Capital.pptxZeljko Svedic
15 views17 slides
Top-5-production-devconMunich-2023-v2.pptx by
Top-5-production-devconMunich-2023-v2.pptxTop-5-production-devconMunich-2023-v2.pptx
Top-5-production-devconMunich-2023-v2.pptxTier1 app
6 views42 slides

Recently uploaded(20)

Airline Booking Software by SharmiMehta
Airline Booking SoftwareAirline Booking Software
Airline Booking Software
SharmiMehta9 views
Bootstrapping vs Venture Capital.pptx by Zeljko Svedic
Bootstrapping vs Venture Capital.pptxBootstrapping vs Venture Capital.pptx
Bootstrapping vs Venture Capital.pptx
Zeljko Svedic15 views
Top-5-production-devconMunich-2023-v2.pptx by Tier1 app
Top-5-production-devconMunich-2023-v2.pptxTop-5-production-devconMunich-2023-v2.pptx
Top-5-production-devconMunich-2023-v2.pptx
Tier1 app6 views
Navigating container technology for enhanced security by Niklas Saari by Metosin Oy
Navigating container technology for enhanced security by Niklas SaariNavigating container technology for enhanced security by Niklas Saari
Navigating container technology for enhanced security by Niklas Saari
Metosin Oy14 views
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx by animuscrm
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
animuscrm15 views
How Workforce Management Software Empowers SMEs | TraQSuite by TraQSuite
How Workforce Management Software Empowers SMEs | TraQSuiteHow Workforce Management Software Empowers SMEs | TraQSuite
How Workforce Management Software Empowers SMEs | TraQSuite
TraQSuite6 views
Understanding HTML terminology by artembondar5
Understanding HTML terminologyUnderstanding HTML terminology
Understanding HTML terminology
artembondar57 views
tecnologia18.docx by nosi6702
tecnologia18.docxtecnologia18.docx
tecnologia18.docx
nosi67025 views
AI and Ml presentation .pptx by FayazAli87
AI and Ml presentation .pptxAI and Ml presentation .pptx
AI and Ml presentation .pptx
FayazAli8714 views
Ports-and-Adapters Architecture for Embedded HMI by Burkhard Stubert
Ports-and-Adapters Architecture for Embedded HMIPorts-and-Adapters Architecture for Embedded HMI
Ports-and-Adapters Architecture for Embedded HMI
Burkhard Stubert29 views
predicting-m3-devopsconMunich-2023.pptx by Tier1 app
predicting-m3-devopsconMunich-2023.pptxpredicting-m3-devopsconMunich-2023.pptx
predicting-m3-devopsconMunich-2023.pptx
Tier1 app8 views
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with... by sparkfabrik
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...
sparkfabrik8 views
Quality Engineer: A Day in the Life by John Valentino
Quality Engineer: A Day in the LifeQuality Engineer: A Day in the Life
Quality Engineer: A Day in the Life
John Valentino7 views
JioEngage_Presentation.pptx by admin125455
JioEngage_Presentation.pptxJioEngage_Presentation.pptx
JioEngage_Presentation.pptx
admin1254558 views

Practical Data Science Workshop - Recommendation Systems - Collaborative Filtering - Strata NY - 2015

  • 1. Practical Data Science on Spark & Hadoop Collaborative Filtering Recommendation Systems Chris Fregly Principal Data Solutions Engineer IBM Spark Technology Center
  • 2. Outline ①  Introduction ②  Live, Interactive, Group Demo! ③  Approximations ④  Similarity ⑤  Recommendations ⑥  Building a Model ⑦  ML Pipelines ⑧  $1 Million Netflix Prize
  • 4. Outline ①  Introduction ②  Live, Interactive, Group Demo! ③  Approximations ④  Similarity ⑤  Recommendations ⑥  Building a Model ⑦  ML Pipelines ⑧  $1 Million Netflix Prize
  • 5. Live, Interactive, Group Demo! ①  Navigate to sparkafterdark.com ②  Select 3 actresses and 3 actors ③  Wait for me to build the models https://github.com/fluxcapacitor/pipeline -->
  • 6. Outline ①  Introduction ②  Live, Interactive, Group Demo! ③  Approximations ④  Similarity ⑤  Recommendations ⑥  Building a Model ⑦  ML Pipelines ⑧  $1 Million Netflix Prize
  • 7. Bloom Filter 7 Approximate set k-hashes on put/get False positives Used all through Spark From Twitter’s Algebird
  • 8. Count Min Sketch 8 Approximate counters Better than HashMap Low, fixed memory Known error bounds Large number of counters From Twitter’s Algebird Streaming example in Spark codebase
  • 9. HyperLogLog 9 Approximate cardinality Approximate count distinct Low memory 1.5KB @ 2% error 10^9 elements! From Twitter’s Algebird Streaming example in Spark codebase countApproxDistinctByKey()
  • 10. Monte Carlo Simulations 1 From Manhattan Project (A-bomb) Simulate movement of neutrons Law of Large Numbers (LLN) Average of results of many trials Converge on expected value SparkPi example in Spark codebase Pi # red dots / # total dots * 4
  • 12. Outline ①  Introduction ②  Live, Interactive, Group Demo! ③  Approximations ④  Similarity ⑤  Recommendations ⑥  Building a Model ⑦  ML Pipelines ⑧  $1 Million Netflix Prize
  • 15. Jaccard Similarity Set Intersection divided by Set Union Bias towards popularity
  • 17. Calculating Similarity “All-pairs similarity” “Pair-wise similarity” “Similarity join” Naïve impl: O(m*n^2); m=rows, n=cols Must minimize shuffle and computation
  • 18. Minimizing Shuffle and Computation Approximate! Reduce m (rows) Sampling Bucketing (aka. “Partitioning” or “Clustering”) Removing rows with sparsity below threshold (ie. inactive) Reduce n (cols) Remove most frequent value (ie. 0) Remove least popular
  • 19. Reduce m (rows): Sampling DIMSUM “Dimension Independent Matrix Square Using MR” Remove rows with low probability of similarity RowMatrix.columnSimilarities() Twitter 40% efficiency gain over naïve cosine similarity ->
  • 20. Reduce m (rows): Bucketing LSH “Locality Sensitive Hashing” Split m into b buckets w/ similarity hash func() Requires pre-processing Compare items within buckets Comparison is parallelizable O(m*n^2) -> O(m*n/b*b^2) O(1.25E17) -> O(1.25E13); b=50
  • 21. Reduce n (cols) Remove most frequent values Replace with (index,value) pairs O(m*n^2) -> O(m*nnz^2); nnz=number of non-zeros, Be sure to choose most frequent value – may not be 0!
  • 22. Outline ①  Introduction ②  Live, Interactive, Group Demo! ③  Approximations ④  Similarity ⑤  Recommendations ⑥  Building a Model ⑦  ML Pipelines ⑧  $1 Million Netflix Prize
  • 23. Recommendation/ML Terminology User: User seeking recommendations Item: Item being recommended Explicit User Feedback: like or rating Implicit User Feedback: search, click, hover, view, scroll Instances: Rows of user feedback/input data Overfitting: Training a model too closely to the training data & hyperparameters Hold Out Split: Holding out some of the instances to avoid overfitting Features: Columns of instance rows (of feedback/input data) Cold Start Problem: Not enough data to personalize (new) Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations, etc) Model Evaluation: Compare predictions to actual values of hold out split
  • 24. Features Dimensions: Alias for Features Binary Features: True or False Numeric Discrete Features: Integers Numeric Features: Real values Ordinal Features: Maintains order (S -> M -> L -> XL -> XXL) Temporal Features: Time-based (Time of Day, Binge Watching) Categorical Features: Finite, unique set of categories(NFL teams) Feature Engineering: Modify, reduce, combine features
  • 25. Feature Engineering Dimension Reduction: Reduce num features or “feature space” Principle Component Analysis (PCA): Find principle features that describe the data One-Hot Encoding: Convert categorical feature vals to 0’s, 1’s Bears -> 1 Bears -> 1,0,0 49’ers -> 2 --> 49’ers -> 0,1,0 Steelers-> 3 Steelers-> 0,0,1
  • 26. Non-Personalized Recommendations “Cold Start” Problem Top K Aggregations Summary Statistics PageRank Facebook Graph
  • 29. User-to-Item Collaborative Filtering Find similar users based on similarity function(s) Cosine similarity, etc Recommend items that other similar users have chosen Exclude items that have already been chosen Rank items by num of similar users who have chosen Alternating Least Squares Matrix Factorization -->
  • 31. Item-to-Item Collaborative Filtering Made famous by Amazon ~2003 Couldn’t scale traditional User-to-Item algos Offline: Generates ItemID::List[CustomerID] vectors Online: For each item in shopping cart, find similar items based on closest List[CustomerID] vector
  • 32. User and Item Clustering (Similarity) Based on Similarity ie. Similar Profile/Description Text or Categories LDA Topic, K-Means, Nearest Neighbor, Eigenfaces, PCA
  • 33. Streaming K Means Clustering Initial set of k clusters with random centers Incoming data: Assign to closest cluster: distance to center Update centers: minimize within-cluster-sum-of-squares Half-life decay factor Reduce contribution of old data to half --> Measured in num batches or num data points Eliminate dead clusters never assigned new data Split existing cluster and join with dead cluster -->
  • 35. Outline ①  Introduction ②  Live, Interactive, Group Demo! ③  Approximations ④  Similarity ⑤  Recommendations ⑥  Building a Model ⑦  ML Pipelines ⑧  $1 Million Netflix Prize
  • 36. Split Instance Data 3 Roles Model Training (80%) Model Validation (10%) Model Testing (10%) k-folds Cross Validation Divide instances into k sections Alternate each k section between 3 roles above http://www.slideshare.net/SebastianRaschka/musicmood-20140912
  • 37. Hyperparameter Selection Select sets of values for each hyperparameter Use GridSearch to find best combo to reduce error Avoid overfitting! http://www.slideshare.net/ogrisel/strategies-and-tools-for-parallel-machine-learning-in-python
  • 38. Evaluation Criteria Regression (Distance has meaning) Root Mean Square Error (RMSE) Mean Absolute Error (MAE) Categorical (Distance does not have meaning) Precision/Accuracy
  • 39. Outline ①  Introduction ②  Live, Interactive, Group Demo! ③  Approximations ④  Similarity ⑤  Recommendations ⑥  Building a Model ⑦  ML Pipelines ⑧  $1 Million Netflix Prize
  • 40. ML Pipelines Inspired by scikit-learn Transformers transform() input for estimation (training) predict() new input Estimators fit() a model to the transformed dataset (training) Pipeline Chain everything together
  • 41. Outline ①  Introduction ②  Live, Interactive, Group Demo! ③  Approximations ④  Similarity ⑤  Recommendations ⑥  Building a Model ⑦  ML Pipelines ⑧  $1 Million Netflix Prize
  • 42. $1 Million Netflix Prize October, 2006 --> Sept 2009 (3 years!!) Winning algorithm beat Netflix by 10.06% based on RMSE Ensemble of 500+ models Combined using Gradient Boosted Decision Trees Computationally intensive and impractical
  • 43. Winning Algorithm Adjustments “Alice effect”: Alice tends to rate lower than the average user “Inception effect”: Inception is rate higher than average movie “Alice-Inception effect”: Combo of Alice and Inception Number of days since a user’s first rating Number of days since a movie’s first rating Number of people who have rated a movie A movie’s overall mean rating Factor these out and find the baseline!
  • 44. Thanks! Chris Fregly @cfregly References ①  https://github.com/fluxcapacitor/pipeline ②  http://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf ③  http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/ ④  http://spark.apache.org/docs/latest/ml-guide.html