DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations

Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
advancedspark.com

Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Who Am I?
2
Streaming Data Engineer
Netflix OSS Committer
Data Solutions Engineer
Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
Meetup Organizer
Advanced Apache Meetup
Book Author
Advanced .
Due 2016

Recent World Tour: Freg-a-Palooza!
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit (Oct 27th)
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
3
Oslo Big Data Hadoop Meetup (Nov 19th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 25th)
Istanbul Spark Meetup (Nov 26th)
Budapest Spark Meetup (Nov 28th)
Singapore Spark Meetup (Dec 1st)
Sydney Spark Meetup (Dec 8th)
Melbourne Spark Meetup (Dec 9th)
Toronto Spark Meetup (Dec 14th)

Advanced Apache Spark Meetup
http://advancedspark.com
Meetup Metrics
Top 5 Most-active Spark Meetup!
2600+ Members in just 6 mos!!
2600+ Docker downloads (demos)
Meetup Mission
Code deep-dive into Spark and related open source projects
Surface key patterns and idioms
Focus on distributed systems, scale, and performance
4

Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
Live, Interactive Demo!
Audience Participation Required!!
Cell Phone Compatible!!!
http://demo.advancedspark.com
5

http://demo.advancedspark.com
End User ->
ElasticSearch ->
Spark ML ->
Data Scientist ->
6
<- Kafka
<- Spark
Streaming
<- Cassandra,
Redis
<- Zeppelin,
iPython

Presentation Outline
① Scaling
② Similarities
③ Recommendations
④ Approximations
⑤ Netflix Recommendations
7

Scaling with Parallelism
8
Peter
O(log n)
O(log n)
Worker
Nodes

Parallelism with Composability
Worker 1 Worker 2
Max (a max b max c max d) == (a max b) max (c max d)
Set Union (a U b U c U d) == (a U b) U (c U d)
Addition (a + b + c + d) == (a + b) + (c + d)
Multiply (a * b * c * d) == (a * b) * (c * d)
9
What about Division and Average?
Collect at Driver

What about Division?
Division (a / b / c / d) != (a / b) / (c / d)
(3 / 4 / 7 / 8) != (3 / 4) / (7 / 8)
(((3 / 4) / 7) / 8) != ((3 * 8) / (4 * 7))
0.134 != 0.857
10
What were the Egyptians thinking?!
Not Composable
“Divide like
an Egyptian”

What about Average?
Overall AVG
(3, 1) (3 + 5 + 5 + 7) 20
+ (5, 1) == -------------------- == --- == 5
+ (5, 1) (1 + 1 + 1 + 1) 4
+ (7, 1)
11
values
counts
Pairwise AVG
(3 + 5) (5 + 7) 8 12 20
------- + ------- == --- + --- == --- == 10 != 5
2 2 2 2 2
Divide, Add, Divide?
Not Composable
Single-Node Divide at the End?
Doesn’t need to be Composable!
AVG (3, 5, 5, 7) == 5
Add, Add, Add?
Composable!

① Scaling
② Similarities
③ Recommendations
④ Approximations
12

Similarities
13

Euclidean Similarity
Exists in Euclidean, flat space
Based on Euclidean distance
Linear measure
Bias towards magnitude
14

Cosine Similarity
Angular measure
Adjusts for Euclidean magnitude bias
Normalize to unit vectors in all dimensions
15
org.jblas.
DoubleMatrix

Jaccard Similarity
Set similarity measurement
Set intersection / set union
Based on Jaccard distance
Bias towards popularity
16

Log Likelihood Similarity
Adjusts for popularity bias
Netflix “Shawshank” problem
17

Word Similarity
Edit Distance
Misspellings and autocorrect
Word2Vec
Similar words are defined by similar contexts in vector space
18
English Spanish

Demo!
Find Synonyms with Word2Vec
19

Find Synonyms using Word2Vec
20

Document Similarity
TF/IDF
Term Freq / Inverse Document Freq
Used by most search engines
Doc2Vec
Similar documents are determined by similar contexts
21

Bonus! Text Rank Document Summary
Text Rank (aka Sentence Rank)
Surface summary sentences
TF/IDF + Similarity Graph + PageRank
Most similar sentence to all other sentences
TF/IDF + Similarity Graph
Most influential sentences
PageRank
22

Similarity Pathways (Recommendations)
Best recommendations for 2 (or more) people
“You like Max Max. I like Message in a Bottle.
We might like a movie similar to both.”
Item-to-Item Similarity Graph + Dijkstra Shortest Path
23

Demo!
Similarity Pathway for Movie Recommendations
24

Load Movies with Tags into DataFrame
25
My
Choice
Their
Choice

Calculate Tag-based Movie Similarity
Based on Tags
26
Jaccard Similarity
(Based on Tag Sets)
Above Jaccard
Similarity Threshold

Create Movie-Tag Similarity Graph
27
Edge Value
Represents
Jaccard Similarity
(Based on Tag Sets)

Calculate Dijkstra Shortest Pathway
28

Movies with Tags
29
My
Choice
Their
Choice
Our
Choice

Calculating Similarity
Exact Brute-Force Similarity
Cartesian Product
O(n^2) shuffle and comparison
aka. All-pairs, Pair-wise, Similarity Join
Approximate Similarity
Sampling
Bucketing or Clustering
Ignore joins of low-similarity probability
Goal: Reduce shuffle
30

Similarity Graph
Vertex is movie, tag, actor, plot summary, etc.
Edges are relationships and weights (if provided)
31

① Scaling
② Similarities
③ Recommendations
④ Approximations
① Netflix Recommendations
32

Recommendations
33

Basic Terminology
User: User seeking recommendations
Item: Item being recommended
Explicit User Feedback: like, rating, movie view, profile read, search
Implicit User Feedback: click, hover, scroll, navigation
Instances: Rows of user feedback/input data
Overfitting: Training a model too closely to the training data & hyperparameters
Hold Out Split: Holding out some of the instances to avoid overfitting
Features: Columns of instance rows (of feedback/input data)
Cold Start Problem: Not enough data to personalize (new)
Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations)
Model Evaluation: Compare predictions to actual values of hold out split
Feature Engineering: Modify, reduce, combine features
34

Features
Binary: True or False
Numeric Discrete: Integers
Numeric: Real Values
Binning: Convert Continuous into Discrete (Time of Day->Morning, Afternoon)
Categorical Ordinal: Size (Small->Medium->Large), Ratings (1->5)
Categorical Nominal: Independent, Favorite Sports Teams, Dating Spots
Temporal: Time-based, Time of Day, Binge Viewing
Text: Movie Titles, Genres, Tags, Reviews (Tokenize, Stop Words, Stemming)
Media: Images, Audio, Video
Geographic: (Longitude, Latitude), Geohash
Latent: Hidden Features within Data (Collaborative Filtering)
Derived: Age of Movie, Duration of User Subscription
35

Feature Engineering
Dimension Reduction
Reduce number of features in feature space
Principle Component Analysis (PCA)
Find principle features that best describe data variance
Peel dimensional layers back
One-Hot Encoding
Convert nominal categorical feature values into 0’s and 1’s
Remove any numerical relationship between categories
Bears -> 1 Bears -> [1.0, 0.0, 0.0]
49’ers -> 2 --> 49’ers -> [0.0, 1.0, 0.0]
Steelers-> 3 Steelers-> [0.0, 0.0, 1.0]
36
Convert Each Item
to Binary Vector
with Single 1.0 Column

Feature Normalization & Standardization
Goal
Scale features to standard size
Required by many ML algos
Normalize Features
Calculate L1 (or L2, etc) norm, then divide into each element
org.apache.spark.ml.feature.Normalizer
Standardize Features
Apply standard normal transformation
mean == 0, stddev == 1
org.apache.spark.ml.feature.StandardScaler
37
http://www.mathsisfun.com/data/standard-normal-distribution.htm

Non-Personalized Recommendations
38

Cold Start Problem
“Cold Start” problem
New user, don’t know their preference, must show something!
Movies with highest-rated actors
Top K aggregations
Facebook social graph
Friend-based recommendations
Most desirable singles
PageRank of likes and dislikes
39

Demo!
GraphFrame PageRank
40

Dating Site Example: Like Graph
41

PageRank of Top Influencers
42

Personalized Recommendations
43

User-to-User Clustering
User Similarity
Time-based
Pattern of viewing (binge or casual)
Time of viewing (am or pm)
Ratings-based
Content ratings or number of views
Average rating relative to others (critical or lenient)
Search-based
Search terms
44

Item-to-Item Clustering
Item Similarity
Profile text (TF/IDF, Word2Vec, NLP)
Categories, tags, interests (Jaccard Similarity, LSH)
Images, facial structures (Neural Nets, Eigenfaces)
Dating Site Example: Items == Users!
45
http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.htmlMy OKCupid Profile My Hinge Profile

Bonus: NLP Conversation Starter Bot
46
“If your responses to my generic opening
lines are positive, I may read your profile.”
Spark ML, Stanford CoreNLP,
TF/IDF, DecisionTrees, Sentiment
http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

Bonus: Demo!
Spark + Stanford CoreNLP Sentiment Analysis
47

Bonus: Top 100 Country Song Sentiment
48

Bonus: Surprising Results…?!
49

Item-to-Item Based Recommendations
Based on Metadata: Genre, Description, Cast, City
50

Demo!
Item-to-Item-based Recommendations
One-Hot Encoding + K-Means Clustering
51

Convert Movie Tags to Feature Vectors
52

Cluster Using Movie-Tag Feature Vectors
53

Analyze Movie Tag Clusters
54

User-to-Item Collaborative Filtering
Matrix Factorization
① Factor the large matrix (left) into 2 smaller matrices (right)
② Lower-rank matrices approximate original when multiplied
③ Fill in the missing values of the large matrix
④ Surface k (rank) latent features from user-item interactions
55

Item-to-Item Collaborative Filtering
Famous Amazon Paper circa 2003
Problem
As users grew, user-to-item collaborative filtering didn’t scale
Solution
Item-to-item similarity, nearest neighbors
Offline (Batch)
Generate itemId->List[userId] vectors
Online (Real-time)
From cart, recommend nearest-neighbors in vector space
56

Demo!
Collaborative Filtering-based Recommendations
57

Fitting the Matrix Factorization Model
58

Show ItemFactors Matrix from ALS
59

Show UserFactors Matrix from ALS
60

Generating Individual Recommendations
61

Generating Batch Recommendations
62

Clustering + Collaborative Filtering Recs
Cluster matrix output from Matrix Factorization
Latent features derived from user-to-item interactions
Item-to-Item Similarity
Cluster item-factor matrix->
User-to-User Similarity
<-Cluster user-factor matrix
63

Demo!
Clustering + Collaborative Filtering-based Recommendations
64

Show ItemFactors Matrix from ALS
65

Convert to Item Factors -> mllib.Vector
Required by K-Means Clustering Algorithm
66

Fit and Evaluate K-Means Cluster Model
67
Measures Closeness
Of Points Within Clusters
K = 5 Clusters

Netflix Genres and Clusters
Typical Genres
Documentary, Romance, Comedy, Horror, Action, Adventure
Latent (Hidden) Clusters
Emotionally-Independent Dramas for Hopeless Romantics
Witty Dysfunctional-Family TV Animated Comedies
Romantic Crime Movies based on Classic Literature
Latin American Forbidden-Love Movies
Critically-acclaimed Emotional Drug Movie
Cerebral Military Movie based on Real Life
Sentimental Movies about Horses for Ages 11-12
Gory Canadian Revenge Movies
Raunchy Mad Scientist Comedy
68

Demo!
Personalized PageRank
69

Personalized PageRank
70

Personalized PageRank (No Outbound)
71

① Scaling
② Similarities
③ Recommendations
④ Approximations
72

When to Approximate?
Memory or time constrained queries
Relative vs. exact counts are OK (# errors between then and now)
Using machine learning or graph algos
Inherently probabilistic and approximate
Finding topics in documents (LDA)
Finding similar pairs of users, items, words at scale (LSH)
Finding top influencers (PageRank)
Streaming aggregations
Inherently sloppy collection (exactly once?)
73
Approximate as much as you can get away with!
Ask for forgiveness later !!

When NOT to Approximate?
If you’ve ever heard the term…
“Sarbanes-Oxley”
…at the office.
74

A Few Good Algorithms
75
You can’t handle
the approximate!

Common to These Algos & Data Structs
Low, fixed size in memory
Known error bounds
Store large amount of data
Less memory than Java/Scala collections
Tunable tradeoff between size and error
Rely on multiple hash functions or operations
Size of hash range defines error
76

Bloom Filter
Set.contains(key): Boolean
“Hash Multiple Times and Flip the Bits Wherever You Land”
77

Bloom Filter
Approximate Set.contains(key)
No means No, Yes means Maybe
Elements can only be added
Never updated or removed
78

Bloom Filter in Action
79
set(key) contains(key): Boolean
Images by @avibryant
Set.contains(key): TRUE -> maybe contains
Set.contains(key): FALSE -> definitely does not contain.

CountMin Sketch
Frequency Count and TopK
“Hash Multiple Times and Add 1 Wherever You Land”
80

CountMin Sketch (CMS)
Approximate frequency count and TopK for key
ie. “Heavy Hitters” on Twitter
81
Matei Zaharia Martin Odersky Donald Trump

CountMin Sketch In Action (TopK,
Count)
82
Images derived from @avibryant
Find minimum of all rows
…
…
Can overestimate,
but never underestimate
Multiple hash functions
(1 hash function per row)
Binary hash output
(1 element per column)
x 2 occurrences of
“Top Gun” for slightly
additional complexity
Top Gun
Top Gun
Top Gun
(x 2)
A Few
Good Men
Taps
Top Gun
(x 2)
add(Top Gun, 2)
getCount(Top Gun): Long
Use Case: TopK movies using total views
add(A Few Good Men, 1)
add(Taps, 1)
A Few
Good Men
Taps
…
…
Overlap Top Gun
Overlap A Few Good Men

HyperLogLog
Count Distinct
“Hash Multiple Times and Uniformly Distribute Where You Land”
83

HyperLogLog (HLL)
Approximate count distinct
Slight twist
Special hash function creates uniform distribution
Error estimate
14 bits for size of range
m = 2^14 = 16,384 hash slots
error = 1.04/(sqrt(16,384)) = .81%84
Not many of these

HyperLogLog In Action (Count Distinct)
Use Case: Number of distinct users who view a movie
85
0 32
Top Gun: Hour 2
user
2001
user
4009
user
3002
user
7002
user
1005
user
6001
User
8001
User
8002
user
1001
user
2009
user
3005
user
3003
Top Gun: Hour 1
user
3001
user
7009
0 16
UniformDistribution:
Estimate distinct # of users by
inspecting just the beginning
0 32
Top Gun: Hour 1 + 2
user
2001
user
4009
user
3002
user
7002
user
1005
user
6001
User
8001
User
8002
Combine across
different scales
user
7009
user
1001
user
2009
user
3005
user
3003
user
3001

Locality Sensitive Hashing
Set Similarity
“Pre-process Items into Buckets, Compare Within Buckets”
86

Locality Sensitive Hashing (LSH)
Approximate set similarity
Hash designed to cluster similar items
Avoids cartesian all-pairs comparison
Pre-process m rows into b buckets
b << m
Hash items multiple times
Similar items hash to overlapping buckets
Compare just contents of buckets
Much smaller cartesian … and parallel !!
87

DIMSUM
Set Similarity
“Pre-process and ignore data that is unlikely to be similar.”
88

DIMSUM
“Dimension Independent Matrix Square Using MR”
Remove vectors with low probability of similarity
RowMatrix.columnSimiliarites(threshold)
Twitter DIMSUM Case Study
40% efficiency gain over bruce-force Cosine Sim
89

Common Tools to Approximate
Twitter Algebird
Redis
Apache Spark
90
Composable Library
Distributed Cache
Big Data Processing

Twitter Algebird
Rooted in Algebraic Fundamentals!
Parallel
Associative
Composable
Examples
Min, Max, Avg
BloomFilter (Set.contains(key))
HyperLogLog (Count Distinct)
CountMin Sketch (TopK Count)
91

Redis
Implementation of HyperLogLog (Count Distinct)
12KB per item count
2^64 max # of items
0.81% error (Tunable)
Add user views for given movie
PFADD TopGun_HLL user1001 user2009 user3005
PFADD TopGun_HLL user3003 user1001
Get distinct count (cardinality) of set
PFCOUNT TopGun_HLL
Returns: 4 (distinct users viewed this movie)
92
ignore duplicates
Tunable
Union 2 HyperLogLog Data Structures
PFMERGE TopGun_HLL Taps_HLL

Spark Approximations
Spark Core
RDD.count*Approx()
Spark SQL
PartialResult
approxCountDistinct(column)
HyperLogLogPlus
Spark ML
Stratified sampling
PairRDD.sampleByKey(fractions: Double[ ])
DIMSUM sampling
Probabilistic sampling reduces amount of shuffle
RowMatrix.columnSimilarities(threshold)
93

Demo!
Exact Count vs. Approximate HLL and CMS Count
94

HashSet vs. HyperLogLog (Memory)
95

HashSet vs. CountMin Sketch (Memory)
96

Demo!
Exact Similarity vs. Approximate LSH Similarity
97

Brute Force Cartesian All Pair Similarity
98
47 seconds

Locality Sensitive Hash All Pair Similarity
99
6 seconds

Many More Demos!
or
Download Docker Clone on Github
100

① Scaling
② Similarities
③ Recommendations
④ Approximations
101

Netflix Recommendations
From Ratings to Real-time
102

Netflix Has a Lot of Data
Netflix has a lot of data about a lot of users and a lot of movies.
Netflix can use this data to buy new movies.
Netflix is global.
Netflix can use this data to choose original programming.
Netflix knows that a lot of people like politics and Kevin Spacey.
103
The UK doesn’t have White Castle.
Renamed my favourite movie to:
“Harold and Kumar
Get the Munchies”
My favorite movie:
“Harold and Kumar
Go to White Castle”
Summary: Buy NFLX Stock!
This broke my unit tests!

Netflix Data Pipeline - Then
104
v1.0
v2.0

Netflix Data Pipeline – Now (Keystone)
105
v3.0
9 million events per second
22 GB per second!!
EC2 D2XL
Disk: 6 TB, 475 MB/s
RAM: 30 G
Network: 700 Mbps
Auto-scaling,
Fault tolerance
A/B Tests,
Trending Now
SAMZA
Splits high and
normal priority

Netflix Recommendation Data Pipeline
106
Throw away
batch-generated
user factors (U)
Keep video
factors (V)

Netflix Trending Now (Time-based Recs)
Uses Spark Streaming
Personalized to user (viewing history, past ratings)
Learns and adapts to events (Valentine’s Day)
107
“VHS”
Number of
Plays
Number of
Impressions
Calculate
Take Rate

Bonus: Pandora Time-based Recs
Work Days
Play familiar music
User is less likely accept new music
Evenings and Weekends
Play new music
More like to accept new music
108

$1 Million Netflix Prize (2006-2009)
Goal
Improve movie predictions by 10% (Root Mean Sq Error)
Test data withheld to calculate RMSE upon submission
5-star Ratings Dataset
(userId, movieId, rating, timestamp)
Winning algorithm(s)
10.06% improvement (RMSE)
Ensemble of 500+ ML combined with GBDT’s
Computationally impractical
109

Secrets to the Winning Algorithms
Adjust for the following human bias…
① Alice effect: user rates lower than avg
② Inception effect: movie rated higher than avg
③ Overall mean rating of a movie
④ Number of people who have rated a movie
⑤ Number of days since user’s first rating
⑥ Number of days since movie’s first rating
⑦ Mood, time of day, day of week, season, weather
110

Netflix Common ML Algorithms
Logistic Regression
Linear Regression
Gradient Boosted Decision Trees
Random Forest
Matrix Factorization
SVD
Restricted Boltzmann Machines
Deep Neural Nets
Markov Models
LDA
Clustering
111
Ensembles!

Netflix Genres and Clusters
Typical Genres
Documentaries, Romance Comedies, Horror, Action, Adventure
Latent (Hidden) Clusters
Emotionally-Independent Dramas for Hopeless Romantics
Witty Dysfunctional-Family TV Animated Comedies
Romantic Crime Movies based on Classic Literature
Latin American Forbidden-Love Movies
Critically-acclaimed Emotional Drug Movie
Cerebral Military Movie based on Real Life
Sentimental Movies about Horses for Ages 11-12
Gory Canadian Revenge Movies
Raunchy Mad Scientist Comedy
112

Netflix Social Integration
Post to Facebook after movie start (5 mins)
Recommend to new users based on friends
Helps with Cold Start problem
113

Netflix Search
No results? No problem… Show similar results!
Utilize extensive DVD Catalog
Metadata search (ElasticSearch)
Named entity recognition (NLP)
Empty searches are opportunity!
Explicit feedback for future recommendations
Content to buy and produce!
114

Higher Ratings in 2004?
2004, Netflix noticed higher ratings on average
Some possible reasons why…
115
① Significant UI improvements deployed
② New recommendation engine deployed
③

Thank You, Everyone!!
Chris Fregly @cfregly
IBM Spark Tech Center
San Francisco, California, USA
Sign up for the Meetup and Book
Contribute to Github Repo
Run all Demos using Docker
Find me: LinkedIn, Twitter, Github, Email, Fax
116
Image derived from http://www.duchess-france.org/

Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
@cfregly

DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations

Similar to DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations (15)

More from Chris Fregly

More from Chris Fregly (20)

Recently uploaded

Recently uploaded (20)

DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations