SlideShare a Scribd company logo
1 of 117
Download to read offline
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
advancedspark.com
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Who Am I?
2
Streaming Data Engineer
Netflix OSS Committer
Data Solutions Engineer
Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
Meetup Organizer
Advanced Apache Meetup
Book Author
Advanced .
Due 2016
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Recent World Tour: Freg-a-Palooza!
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit (Oct 27th)
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
3
Oslo Big Data Hadoop Meetup (Nov 19th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 25th)
Istanbul Spark Meetup (Nov 26th)
Budapest Spark Meetup (Nov 28th)
Singapore Spark Meetup (Dec 1st)
Sydney Spark Meetup (Dec 8th)
Melbourne Spark Meetup (Dec 9th)
Toronto Spark Meetup (Dec 14th)
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Advanced Apache Spark Meetup
http://advancedspark.com
Meetup Metrics
Top 5 Most-active Spark Meetup!
2600+ Members in just 6 mos!!
2600+ Docker downloads (demos)
Meetup Mission
Code deep-dive into Spark and related open source projects
Surface key patterns and idioms
Focus on distributed systems, scale, and performance
4
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
Live, Interactive Demo!
Audience Participation Required!!
Cell Phone Compatible!!!
http://demo.advancedspark.com
5
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
http://demo.advancedspark.com
End User ->
ElasticSearch ->
Spark ML ->
Data Scientist ->
6
<- Kafka
<- Spark
Streaming
<- Cassandra,
Redis
<- Zeppelin,
iPython
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Presentation Outline
① Scaling
② Similarities
③ Recommendations
④ Approximations
⑤ Netflix Recommendations
7
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Scaling with Parallelism
8
Peter
O(log n)
O(log n)
Worker
Nodes
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Parallelism with Composability
Worker 1 Worker 2
Max (a max b max c max d) == (a max b) max (c max d)
Set Union (a U b U c U d) == (a U b) U (c U d)
Addition (a + b + c + d) == (a + b) + (c + d)
Multiply (a * b * c * d) == (a * b) * (c * d)
9
What about Division and Average?
Collect at Driver
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
What about Division?
Division (a / b / c / d) != (a / b) / (c / d)
(3 / 4 / 7 / 8) != (3 / 4) / (7 / 8)
(((3 / 4) / 7) / 8) != ((3 * 8) / (4 * 7))
0.134 != 0.857
10
What were the Egyptians thinking?!
Not Composable
“Divide like
an Egyptian”
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
What about Average?
Overall AVG
(3, 1) (3 + 5 + 5 + 7) 20
+ (5, 1) == -------------------- == --- == 5
+ (5, 1) (1 + 1 + 1 + 1) 4
+ (7, 1)
11
values
counts
Pairwise AVG
(3 + 5) (5 + 7) 8 12 20
------- + ------- == --- + --- == --- == 10 != 5
2 2 2 2 2
Divide, Add, Divide?
Not Composable
Single-Node Divide at the End?
Doesn’t need to be Composable!
AVG (3, 5, 5, 7) == 5
Add, Add, Add?
Composable!
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Presentation Outline
① Scaling
② Similarities
③ Recommendations
④ Approximations
⑤ Netflix Recommendations
12
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
Similarities
13
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Euclidean Similarity
Exists in Euclidean, flat space
Based on Euclidean distance
Linear measure
Bias towards magnitude
14
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Cosine Similarity
Angular measure
Adjusts for Euclidean magnitude bias
Normalize to unit vectors in all dimensions
15
org.jblas.
DoubleMatrix
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Jaccard Similarity
Set similarity measurement
Set intersection / set union
Based on Jaccard distance
Bias towards popularity
16
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Log Likelihood Similarity
Adjusts for popularity bias
Netflix “Shawshank” problem
17
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Word Similarity
Edit Distance
Misspellings and autocorrect
Word2Vec
Similar words are defined by similar contexts in vector space
18
English Spanish
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
Demo!
Find Synonyms with Word2Vec
19
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Find Synonyms using Word2Vec
20
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Document Similarity
TF/IDF
Term Freq / Inverse Document Freq
Used by most search engines
Doc2Vec
Similar documents are determined by similar contexts
21
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Bonus! Text Rank Document Summary
Text Rank (aka Sentence Rank)
Surface summary sentences
TF/IDF + Similarity Graph + PageRank
Most similar sentence to all other sentences
TF/IDF + Similarity Graph
Most influential sentences
PageRank
22
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Similarity Pathways (Recommendations)
Best recommendations for 2 (or more) people
“You like Max Max. I like Message in a Bottle.
We might like a movie similar to both.”
Item-to-Item Similarity Graph + Dijkstra Shortest Path
23
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
Demo!
Similarity Pathway for Movie Recommendations
24
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Load Movies with Tags into DataFrame
25
My
Choice
Their
Choice
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Calculate Tag-based Movie Similarity
Based on Tags
26
Jaccard Similarity
(Based on Tag Sets)
Above Jaccard
Similarity Threshold
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Create Movie-Tag Similarity Graph
27
Edge Value
Represents
Jaccard Similarity
(Based on Tag Sets)
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Calculate Dijkstra Shortest Pathway
28
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Movies with Tags
29
My
Choice
Their
Choice
Our
Choice
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Calculating Similarity
Exact Brute-Force Similarity
Cartesian Product
O(n^2) shuffle and comparison
aka. All-pairs, Pair-wise, Similarity Join
Approximate Similarity
Sampling
Bucketing or Clustering
Ignore joins of low-similarity probability
Goal: Reduce shuffle
30
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Similarity Graph
Vertex is movie, tag, actor, plot summary, etc.
Edges are relationships and weights (if provided)
31
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Presentation Outline
① Scaling
② Similarities
③ Recommendations
④ Approximations
① Netflix Recommendations
32
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
Recommendations
33
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Basic Terminology
User: User seeking recommendations
Item: Item being recommended
Explicit User Feedback: like, rating, movie view, profile read, search
Implicit User Feedback: click, hover, scroll, navigation
Instances: Rows of user feedback/input data
Overfitting: Training a model too closely to the training data & hyperparameters
Hold Out Split: Holding out some of the instances to avoid overfitting
Features: Columns of instance rows (of feedback/input data)
Cold Start Problem: Not enough data to personalize (new)
Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations)
Model Evaluation: Compare predictions to actual values of hold out split
Feature Engineering: Modify, reduce, combine features
34
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Features
Binary: True or False
Numeric Discrete: Integers
Numeric: Real Values
Binning: Convert Continuous into Discrete (Time of Day->Morning, Afternoon)
Categorical Ordinal: Size (Small->Medium->Large), Ratings (1->5)
Categorical Nominal: Independent, Favorite Sports Teams, Dating Spots
Temporal: Time-based, Time of Day, Binge Viewing
Text: Movie Titles, Genres, Tags, Reviews (Tokenize, Stop Words, Stemming)
Media: Images, Audio, Video
Geographic: (Longitude, Latitude), Geohash
Latent: Hidden Features within Data (Collaborative Filtering)
Derived: Age of Movie, Duration of User Subscription
35
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Feature Engineering
Dimension Reduction
Reduce number of features in feature space
Principle Component Analysis (PCA)
Find principle features that best describe data variance
Peel dimensional layers back
One-Hot Encoding
Convert nominal categorical feature values into 0’s and 1’s
Remove any numerical relationship between categories
Bears -> 1 Bears -> [1.0, 0.0, 0.0]
49’ers -> 2 --> 49’ers -> [0.0, 1.0, 0.0]
Steelers-> 3 Steelers-> [0.0, 0.0, 1.0]
36
Convert Each Item
to Binary Vector
with Single 1.0 Column
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Feature Normalization & Standardization
Goal
Scale features to standard size
Required by many ML algos
Normalize Features
Calculate L1 (or L2, etc) norm, then divide into each element
org.apache.spark.ml.feature.Normalizer
Standardize Features
Apply standard normal transformation
mean == 0, stddev == 1
org.apache.spark.ml.feature.StandardScaler
37
http://www.mathsisfun.com/data/standard-normal-distribution.htm
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
Non-Personalized Recommendations
38
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Cold Start Problem
“Cold Start” problem
New user, don’t know their preference, must show something!
Movies with highest-rated actors
Top K aggregations
Facebook social graph
Friend-based recommendations
Most desirable singles
PageRank of likes and dislikes
39
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
Demo!
GraphFrame PageRank
40
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Dating Site Example: Like Graph
41
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
PageRank of Top Influencers
42
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
Personalized Recommendations
43
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
User-to-User Clustering
User Similarity
Time-based
Pattern of viewing (binge or casual)
Time of viewing (am or pm)
Ratings-based
Content ratings or number of views
Average rating relative to others (critical or lenient)
Search-based
Search terms
44
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Item-to-Item Clustering
Item Similarity
Profile text (TF/IDF, Word2Vec, NLP)
Categories, tags, interests (Jaccard Similarity, LSH)
Images, facial structures (Neural Nets, Eigenfaces)
Dating Site Example: Items == Users!
45
http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.htmlMy OKCupid Profile My Hinge Profile
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Bonus: NLP Conversation Starter Bot
46
“If your responses to my generic opening
lines are positive, I may read your profile.”
Spark ML, Stanford CoreNLP,
TF/IDF, DecisionTrees, Sentiment
http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
Bonus: Demo!
Spark + Stanford CoreNLP Sentiment Analysis
47
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Bonus: Top 100 Country Song Sentiment
48
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Bonus: Surprising Results…?!
49
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Item-to-Item Based Recommendations
Based on Metadata: Genre, Description, Cast, City
50
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
Demo!
Item-to-Item-based Recommendations
One-Hot Encoding + K-Means Clustering
51
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Convert Movie Tags to Feature Vectors
52
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Cluster Using Movie-Tag Feature Vectors
53
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Analyze Movie Tag Clusters
54
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
User-to-Item Collaborative Filtering
Matrix Factorization
① Factor the large matrix (left) into 2 smaller matrices (right)
② Lower-rank matrices approximate original when multiplied
③ Fill in the missing values of the large matrix
④ Surface k (rank) latent features from user-item interactions
55
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Item-to-Item Collaborative Filtering
Famous Amazon Paper circa 2003
Problem
As users grew, user-to-item collaborative filtering didn’t scale
Solution
Item-to-item similarity, nearest neighbors
Offline (Batch)
Generate itemId->List[userId] vectors
Online (Real-time)
From cart, recommend nearest-neighbors in vector space
56
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
Demo!
Collaborative Filtering-based Recommendations
57
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Fitting the Matrix Factorization Model
58
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Show ItemFactors Matrix from ALS
59
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Show UserFactors Matrix from ALS
60
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Generating Individual Recommendations
61
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Generating Batch Recommendations
62
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Clustering + Collaborative Filtering Recs
Cluster matrix output from Matrix Factorization
Latent features derived from user-to-item interactions
Item-to-Item Similarity
Cluster item-factor matrix->
User-to-User Similarity
<-Cluster user-factor matrix
63
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
Demo!
Clustering + Collaborative Filtering-based Recommendations
64
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Show ItemFactors Matrix from ALS
65
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Convert to Item Factors -> mllib.Vector
Required by K-Means Clustering Algorithm
66
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Fit and Evaluate K-Means Cluster Model
67
Measures Closeness
Of Points Within Clusters
K = 5 Clusters
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Netflix Genres and Clusters
Typical Genres
Documentary, Romance, Comedy, Horror, Action, Adventure
Latent (Hidden) Clusters
Emotionally-Independent Dramas for Hopeless Romantics
Witty Dysfunctional-Family TV Animated Comedies
Romantic Crime Movies based on Classic Literature
Latin American Forbidden-Love Movies
Critically-acclaimed Emotional Drug Movie
Cerebral Military Movie based on Real Life
Sentimental Movies about Horses for Ages 11-12
Gory Canadian Revenge Movies
Raunchy Mad Scientist Comedy
68
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
Demo!
Personalized PageRank
69
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Personalized PageRank
70
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Personalized PageRank (No Outbound)
71
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Presentation Outline
① Scaling
② Similarities
③ Recommendations
④ Approximations
⑤ Netflix Recommendations
72
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
When to Approximate?
Memory or time constrained queries
Relative vs. exact counts are OK (# errors between then and now)
Using machine learning or graph algos
Inherently probabilistic and approximate
Finding topics in documents (LDA)
Finding similar pairs of users, items, words at scale (LSH)
Finding top influencers (PageRank)
Streaming aggregations
Inherently sloppy collection (exactly once?)
73
Approximate as much as you can get away with!
Ask for forgiveness later !!
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
When NOT to Approximate?
If you’ve ever heard the term…
“Sarbanes-Oxley”
…at the office.
74
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
A Few Good Algorithms
75
You can’t handle
the approximate!
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Common to These Algos & Data Structs
Low, fixed size in memory
Known error bounds
Store large amount of data
Less memory than Java/Scala collections
Tunable tradeoff between size and error
Rely on multiple hash functions or operations
Size of hash range defines error
76
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
Bloom Filter
Set.contains(key): Boolean
“Hash Multiple Times and Flip the Bits Wherever You Land”
77
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Bloom Filter
Approximate Set.contains(key)
No means No, Yes means Maybe
Elements can only be added
Never updated or removed
78
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Bloom Filter in Action
79
set(key) contains(key): Boolean
Images by @avibryant
Set.contains(key): TRUE -> maybe contains
Set.contains(key): FALSE -> definitely does not contain.
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
CountMin Sketch
Frequency Count and TopK
“Hash Multiple Times and Add 1 Wherever You Land”
80
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
CountMin Sketch (CMS)
Approximate frequency count and TopK for key
ie. “Heavy Hitters” on Twitter
81
Matei Zaharia Martin Odersky Donald Trump
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
CountMin Sketch In Action (TopK,
Count)
82
Images derived from @avibryant
Find minimum of all rows
…
…
Can overestimate,
but never underestimate
Multiple hash functions
(1 hash function per row)
Binary hash output
(1 element per column)
x 2 occurrences of
“Top Gun” for slightly
additional complexity
Top Gun
Top Gun
Top Gun
(x 2)
A Few
Good Men
Taps
Top Gun
(x 2)
add(Top Gun, 2)
getCount(Top Gun): Long
Use Case: TopK movies using total views
add(A Few Good Men, 1)
add(Taps, 1)
A Few
Good Men
Taps
…
…
Overlap Top Gun
Overlap A Few Good Men
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
HyperLogLog
Count Distinct
“Hash Multiple Times and Uniformly Distribute Where You Land”
83
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
HyperLogLog (HLL)
Approximate count distinct
Slight twist
Special hash function creates uniform distribution
Error estimate
14 bits for size of range
m = 2^14 = 16,384 hash slots
error = 1.04/(sqrt(16,384)) = .81%84
Not many of these
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
HyperLogLog In Action (Count Distinct)
Use Case: Number of distinct users who view a movie
85
0 32
Top Gun: Hour 2
user
2001
user
4009
user
3002
user
7002
user
1005
user
6001
User
8001
User
8002
user
1001
user
2009
user
3005
user
3003
Top Gun: Hour 1
user
3001
user
7009
0 16
UniformDistribution:
Estimate distinct # of users by
inspecting just the beginning
0 32
Top Gun: Hour 1 + 2
user
2001
user
4009
user
3002
user
7002
user
1005
user
6001
User
8001
User
8002
Combine across
different scales
user
7009
user
1001
user
2009
user
3005
user
3003
user
3001
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
Locality Sensitive Hashing
Set Similarity
“Pre-process Items into Buckets, Compare Within Buckets”
86
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Locality Sensitive Hashing (LSH)
Approximate set similarity
Hash designed to cluster similar items
Avoids cartesian all-pairs comparison
Pre-process m rows into b buckets
b << m
Hash items multiple times
Similar items hash to overlapping buckets
Compare just contents of buckets
Much smaller cartesian … and parallel !!
87
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
DIMSUM
Set Similarity
“Pre-process and ignore data that is unlikely to be similar.”
88
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
DIMSUM
“Dimension Independent Matrix Square Using MR”
Remove vectors with low probability of similarity
RowMatrix.columnSimiliarites(threshold)
Twitter DIMSUM Case Study
40% efficiency gain over bruce-force Cosine Sim
89
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Common Tools to Approximate
Twitter Algebird
Redis
Apache Spark
90
Composable Library
Distributed Cache
Big Data Processing
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Twitter Algebird
Rooted in Algebraic Fundamentals!
Parallel
Associative
Composable
Examples
Min, Max, Avg
BloomFilter (Set.contains(key))
HyperLogLog (Count Distinct)
CountMin Sketch (TopK Count)
91
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Redis
Implementation of HyperLogLog (Count Distinct)
12KB per item count
2^64 max # of items
0.81% error (Tunable)
Add user views for given movie
PFADD TopGun_HLL user1001 user2009 user3005
PFADD TopGun_HLL user3003 user1001
Get distinct count (cardinality) of set
PFCOUNT TopGun_HLL
Returns: 4 (distinct users viewed this movie)
92
ignore duplicates
Tunable
Union 2 HyperLogLog Data Structures
PFMERGE TopGun_HLL Taps_HLL
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Spark Approximations
Spark Core
RDD.count*Approx()
Spark SQL
PartialResult
approxCountDistinct(column)
HyperLogLogPlus
Spark ML
Stratified sampling
PairRDD.sampleByKey(fractions: Double[ ])
DIMSUM sampling
Probabilistic sampling reduces amount of shuffle
RowMatrix.columnSimilarities(threshold)
93
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
Demo!
Exact Count vs. Approximate HLL and CMS Count
94
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
HashSet vs. HyperLogLog (Memory)
95
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
HashSet vs. CountMin Sketch (Memory)
96
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
Demo!
Exact Similarity vs. Approximate LSH Similarity
97
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Brute Force Cartesian All Pair Similarity
98
47 seconds
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Locality Sensitive Hash All Pair Similarity
99
6 seconds
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
Many More Demos!
or
Download Docker Clone on Github
100
http://advancedspark.com
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Presentation Outline
① Scaling
② Similarities
③ Recommendations
④ Approximations
⑤ Netflix Recommendations
101
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
Netflix Recommendations
From Ratings to Real-time
102
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Netflix Has a Lot of Data
Netflix has a lot of data about a lot of users and a lot of movies.
Netflix can use this data to buy new movies.
Netflix is global.
Netflix can use this data to choose original programming.
Netflix knows that a lot of people like politics and Kevin Spacey.
103
The UK doesn’t have White Castle.
Renamed my favourite movie to:
“Harold and Kumar
Get the Munchies”
My favorite movie:
“Harold and Kumar
Go to White Castle”
Summary: Buy NFLX Stock!
This broke my unit tests!
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Netflix Data Pipeline - Then
104
v1.0
v2.0
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Netflix Data Pipeline – Now (Keystone)
105
v3.0
9 million events per second
22 GB per second!!
EC2 D2XL
Disk: 6 TB, 475 MB/s
RAM: 30 G
Network: 700 Mbps
Auto-scaling,
Fault tolerance
A/B Tests,
Trending Now
SAMZA
Splits high and
normal priority
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Netflix Recommendation Data Pipeline
106
Throw away
batch-generated
user factors (U)
Keep video
factors (V)
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Netflix Trending Now (Time-based Recs)
Uses Spark Streaming
Personalized to user (viewing history, past ratings)
Learns and adapts to events (Valentine’s Day)
107
“VHS”
Number of
Plays
Number of
Impressions
Calculate
Take Rate
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Bonus: Pandora Time-based Recs
Work Days
Play familiar music
User is less likely accept new music
Evenings and Weekends
Play new music
More like to accept new music
108
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
$1 Million Netflix Prize (2006-2009)
Goal
Improve movie predictions by 10% (Root Mean Sq Error)
Test data withheld to calculate RMSE upon submission
5-star Ratings Dataset
(userId, movieId, rating, timestamp)
Winning algorithm(s)
10.06% improvement (RMSE)
Ensemble of 500+ ML combined with GBDT’s
Computationally impractical
109
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Secrets to the Winning Algorithms
Adjust for the following human bias…
① Alice effect: user rates lower than avg
② Inception effect: movie rated higher than avg
③ Overall mean rating of a movie
④ Number of people who have rated a movie
⑤ Number of days since user’s first rating
⑥ Number of days since movie’s first rating
⑦ Mood, time of day, day of week, season, weather
110
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Netflix Common ML Algorithms
Logistic Regression
Linear Regression
Gradient Boosted Decision Trees
Random Forest
Matrix Factorization
SVD
Restricted Boltzmann Machines
Deep Neural Nets
Markov Models
LDA
Clustering
111
Ensembles!
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Netflix Genres and Clusters
Typical Genres
Documentaries, Romance Comedies, Horror, Action, Adventure
Latent (Hidden) Clusters
Emotionally-Independent Dramas for Hopeless Romantics
Witty Dysfunctional-Family TV Animated Comedies
Romantic Crime Movies based on Classic Literature
Latin American Forbidden-Love Movies
Critically-acclaimed Emotional Drug Movie
Cerebral Military Movie based on Real Life
Sentimental Movies about Horses for Ages 11-12
Gory Canadian Revenge Movies
Raunchy Mad Scientist Comedy
112
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Netflix Social Integration
Post to Facebook after movie start (5 mins)
Recommend to new users based on friends
Helps with Cold Start problem
113
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Netflix Search
No results? No problem… Show similar results!
Utilize extensive DVD Catalog
Metadata search (ElasticSearch)
Named entity recognition (NLP)
Empty searches are opportunity!
Explicit feedback for future recommendations
Content to buy and produce!
114
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Higher Ratings in 2004?
2004, Netflix noticed higher ratings on average
Some possible reasons why…
115
① Significant UI improvements deployed
② New recommendation engine deployed
③
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark
Thank You, Everyone!!
Chris Fregly @cfregly
IBM Spark Tech Center
San Francisco, California, USA
http://advancedspark.com
Sign up for the Meetup and Book
Contribute to Github Repo
Run all Demos using Docker
Find me: LinkedIn, Twitter, Github, Email, Fax
116
Image derived from http://www.duchess-france.org/
Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
http://advancedspark.com
@cfregly

More Related Content

What's hot

USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
Chris Fregly
 
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016
Chris Fregly
 
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Chris Fregly
 
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
Chris Fregly
 

What's hot (20)

Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
 
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations
 
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
 
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
 
Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015
 
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
 
Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016
 
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
 
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
 
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
 
Similarity at Scale
Similarity at ScaleSimilarity at Scale
Similarity at Scale
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016
 
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
 
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
 
Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015
 
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
 
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
 
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
 
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
 

Viewers also liked

Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Chris Fregly
 
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Chris Fregly
 

Viewers also liked (12)

Atlanta Hadoop Users Meetup 09 21 2016
Atlanta Hadoop Users Meetup 09 21 2016Atlanta Hadoop Users Meetup 09 21 2016
Atlanta Hadoop Users Meetup 09 21 2016
 
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
 
Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016
 
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
 
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
 
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
 
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
 
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
 
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
 
Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 

Similar to DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations

Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Chris Fregly
 

Similar to DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations (15)

Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
 
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherTactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark Together
 
Trivadis TechEvent 2017 Querying distributed data with SQL and Apache Drill b...
Trivadis TechEvent 2017 Querying distributed data with SQL and Apache Drill b...Trivadis TechEvent 2017 Querying distributed data with SQL and Apache Drill b...
Trivadis TechEvent 2017 Querying distributed data with SQL and Apache Drill b...
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides
CloudCamp Chicago - Big Data & Cloud May 2015 - All SlidesCloudCamp Chicago - Big Data & Cloud May 2015 - All Slides
CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?
 

More from Chris Fregly

Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
Chris Fregly
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
Chris Fregly
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Chris Fregly
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Chris Fregly
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Chris Fregly
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Chris Fregly
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Chris Fregly
 

More from Chris Fregly (20)

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and Data
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdf
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:Cap
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
 

Recently uploaded

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Recently uploaded (20)

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 

DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations

  • 1. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc advancedspark.com
  • 2. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Who Am I? 2 Streaming Data Engineer Netflix OSS Committer Data Solutions Engineer Apache Contributor Principal Data Solutions Engineer IBM Technology Center Meetup Organizer Advanced Apache Meetup Book Author Advanced . Due 2016
  • 3. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Recent World Tour: Freg-a-Palooza! London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th) Barcelona Spark Meetup (Oct 20th) Madrid Big Data Meetup (Oct 22nd) Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit (Oct 27th) Brussels Spark Meetup (Oct 30th) Zurich Big Data Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) 3 Oslo Big Data Hadoop Meetup (Nov 19th) Helsinki Spark Meetup (Nov 20th) Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 25th) Istanbul Spark Meetup (Nov 26th) Budapest Spark Meetup (Nov 28th) Singapore Spark Meetup (Dec 1st) Sydney Spark Meetup (Dec 8th) Melbourne Spark Meetup (Dec 9th) Toronto Spark Meetup (Dec 14th)
  • 4. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Advanced Apache Spark Meetup http://advancedspark.com Meetup Metrics Top 5 Most-active Spark Meetup! 2600+ Members in just 6 mos!! 2600+ Docker downloads (demos) Meetup Mission Code deep-dive into Spark and related open source projects Surface key patterns and idioms Focus on distributed systems, scale, and performance 4
  • 5. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc Live, Interactive Demo! Audience Participation Required!! Cell Phone Compatible!!! http://demo.advancedspark.com 5
  • 6. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark http://demo.advancedspark.com End User -> ElasticSearch -> Spark ML -> Data Scientist -> 6 <- Kafka <- Spark Streaming <- Cassandra, Redis <- Zeppelin, iPython
  • 7. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Presentation Outline ① Scaling ② Similarities ③ Recommendations ④ Approximations ⑤ Netflix Recommendations 7
  • 8. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Scaling with Parallelism 8 Peter O(log n) O(log n) Worker Nodes
  • 9. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Parallelism with Composability Worker 1 Worker 2 Max (a max b max c max d) == (a max b) max (c max d) Set Union (a U b U c U d) == (a U b) U (c U d) Addition (a + b + c + d) == (a + b) + (c + d) Multiply (a * b * c * d) == (a * b) * (c * d) 9 What about Division and Average? Collect at Driver
  • 10. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark What about Division? Division (a / b / c / d) != (a / b) / (c / d) (3 / 4 / 7 / 8) != (3 / 4) / (7 / 8) (((3 / 4) / 7) / 8) != ((3 * 8) / (4 * 7)) 0.134 != 0.857 10 What were the Egyptians thinking?! Not Composable “Divide like an Egyptian”
  • 11. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark What about Average? Overall AVG (3, 1) (3 + 5 + 5 + 7) 20 + (5, 1) == -------------------- == --- == 5 + (5, 1) (1 + 1 + 1 + 1) 4 + (7, 1) 11 values counts Pairwise AVG (3 + 5) (5 + 7) 8 12 20 ------- + ------- == --- + --- == --- == 10 != 5 2 2 2 2 2 Divide, Add, Divide? Not Composable Single-Node Divide at the End? Doesn’t need to be Composable! AVG (3, 5, 5, 7) == 5 Add, Add, Add? Composable!
  • 12. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Presentation Outline ① Scaling ② Similarities ③ Recommendations ④ Approximations ⑤ Netflix Recommendations 12
  • 13. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc Similarities 13
  • 14. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Euclidean Similarity Exists in Euclidean, flat space Based on Euclidean distance Linear measure Bias towards magnitude 14
  • 15. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Cosine Similarity Angular measure Adjusts for Euclidean magnitude bias Normalize to unit vectors in all dimensions 15 org.jblas. DoubleMatrix
  • 16. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Jaccard Similarity Set similarity measurement Set intersection / set union Based on Jaccard distance Bias towards popularity 16
  • 17. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Log Likelihood Similarity Adjusts for popularity bias Netflix “Shawshank” problem 17
  • 18. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Word Similarity Edit Distance Misspellings and autocorrect Word2Vec Similar words are defined by similar contexts in vector space 18 English Spanish
  • 19. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc Demo! Find Synonyms with Word2Vec 19
  • 20. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Find Synonyms using Word2Vec 20
  • 21. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Document Similarity TF/IDF Term Freq / Inverse Document Freq Used by most search engines Doc2Vec Similar documents are determined by similar contexts 21
  • 22. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Bonus! Text Rank Document Summary Text Rank (aka Sentence Rank) Surface summary sentences TF/IDF + Similarity Graph + PageRank Most similar sentence to all other sentences TF/IDF + Similarity Graph Most influential sentences PageRank 22
  • 23. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Similarity Pathways (Recommendations) Best recommendations for 2 (or more) people “You like Max Max. I like Message in a Bottle. We might like a movie similar to both.” Item-to-Item Similarity Graph + Dijkstra Shortest Path 23
  • 24. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc Demo! Similarity Pathway for Movie Recommendations 24
  • 25. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Load Movies with Tags into DataFrame 25 My Choice Their Choice
  • 26. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Calculate Tag-based Movie Similarity Based on Tags 26 Jaccard Similarity (Based on Tag Sets) Above Jaccard Similarity Threshold
  • 27. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Create Movie-Tag Similarity Graph 27 Edge Value Represents Jaccard Similarity (Based on Tag Sets)
  • 28. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Calculate Dijkstra Shortest Pathway 28
  • 29. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Movies with Tags 29 My Choice Their Choice Our Choice
  • 30. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Calculating Similarity Exact Brute-Force Similarity Cartesian Product O(n^2) shuffle and comparison aka. All-pairs, Pair-wise, Similarity Join Approximate Similarity Sampling Bucketing or Clustering Ignore joins of low-similarity probability Goal: Reduce shuffle 30
  • 31. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Similarity Graph Vertex is movie, tag, actor, plot summary, etc. Edges are relationships and weights (if provided) 31
  • 32. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Presentation Outline ① Scaling ② Similarities ③ Recommendations ④ Approximations ① Netflix Recommendations 32
  • 33. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc Recommendations 33
  • 34. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Basic Terminology User: User seeking recommendations Item: Item being recommended Explicit User Feedback: like, rating, movie view, profile read, search Implicit User Feedback: click, hover, scroll, navigation Instances: Rows of user feedback/input data Overfitting: Training a model too closely to the training data & hyperparameters Hold Out Split: Holding out some of the instances to avoid overfitting Features: Columns of instance rows (of feedback/input data) Cold Start Problem: Not enough data to personalize (new) Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations) Model Evaluation: Compare predictions to actual values of hold out split Feature Engineering: Modify, reduce, combine features 34
  • 35. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Features Binary: True or False Numeric Discrete: Integers Numeric: Real Values Binning: Convert Continuous into Discrete (Time of Day->Morning, Afternoon) Categorical Ordinal: Size (Small->Medium->Large), Ratings (1->5) Categorical Nominal: Independent, Favorite Sports Teams, Dating Spots Temporal: Time-based, Time of Day, Binge Viewing Text: Movie Titles, Genres, Tags, Reviews (Tokenize, Stop Words, Stemming) Media: Images, Audio, Video Geographic: (Longitude, Latitude), Geohash Latent: Hidden Features within Data (Collaborative Filtering) Derived: Age of Movie, Duration of User Subscription 35
  • 36. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Feature Engineering Dimension Reduction Reduce number of features in feature space Principle Component Analysis (PCA) Find principle features that best describe data variance Peel dimensional layers back One-Hot Encoding Convert nominal categorical feature values into 0’s and 1’s Remove any numerical relationship between categories Bears -> 1 Bears -> [1.0, 0.0, 0.0] 49’ers -> 2 --> 49’ers -> [0.0, 1.0, 0.0] Steelers-> 3 Steelers-> [0.0, 0.0, 1.0] 36 Convert Each Item to Binary Vector with Single 1.0 Column
  • 37. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Feature Normalization & Standardization Goal Scale features to standard size Required by many ML algos Normalize Features Calculate L1 (or L2, etc) norm, then divide into each element org.apache.spark.ml.feature.Normalizer Standardize Features Apply standard normal transformation mean == 0, stddev == 1 org.apache.spark.ml.feature.StandardScaler 37 http://www.mathsisfun.com/data/standard-normal-distribution.htm
  • 38. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc Non-Personalized Recommendations 38
  • 39. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Cold Start Problem “Cold Start” problem New user, don’t know their preference, must show something! Movies with highest-rated actors Top K aggregations Facebook social graph Friend-based recommendations Most desirable singles PageRank of likes and dislikes 39
  • 40. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc Demo! GraphFrame PageRank 40
  • 41. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Dating Site Example: Like Graph 41
  • 42. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark PageRank of Top Influencers 42
  • 43. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc Personalized Recommendations 43
  • 44. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark User-to-User Clustering User Similarity Time-based Pattern of viewing (binge or casual) Time of viewing (am or pm) Ratings-based Content ratings or number of views Average rating relative to others (critical or lenient) Search-based Search terms 44
  • 45. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Item-to-Item Clustering Item Similarity Profile text (TF/IDF, Word2Vec, NLP) Categories, tags, interests (Jaccard Similarity, LSH) Images, facial structures (Neural Nets, Eigenfaces) Dating Site Example: Items == Users! 45 http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.htmlMy OKCupid Profile My Hinge Profile
  • 46. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Bonus: NLP Conversation Starter Bot 46 “If your responses to my generic opening lines are positive, I may read your profile.” Spark ML, Stanford CoreNLP, TF/IDF, DecisionTrees, Sentiment http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
  • 47. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc Bonus: Demo! Spark + Stanford CoreNLP Sentiment Analysis 47
  • 48. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Bonus: Top 100 Country Song Sentiment 48
  • 49. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Bonus: Surprising Results…?! 49
  • 50. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Item-to-Item Based Recommendations Based on Metadata: Genre, Description, Cast, City 50
  • 51. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc Demo! Item-to-Item-based Recommendations One-Hot Encoding + K-Means Clustering 51
  • 52. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Convert Movie Tags to Feature Vectors 52
  • 53. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Cluster Using Movie-Tag Feature Vectors 53
  • 54. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Analyze Movie Tag Clusters 54
  • 55. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark User-to-Item Collaborative Filtering Matrix Factorization ① Factor the large matrix (left) into 2 smaller matrices (right) ② Lower-rank matrices approximate original when multiplied ③ Fill in the missing values of the large matrix ④ Surface k (rank) latent features from user-item interactions 55
  • 56. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Item-to-Item Collaborative Filtering Famous Amazon Paper circa 2003 Problem As users grew, user-to-item collaborative filtering didn’t scale Solution Item-to-item similarity, nearest neighbors Offline (Batch) Generate itemId->List[userId] vectors Online (Real-time) From cart, recommend nearest-neighbors in vector space 56
  • 57. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc Demo! Collaborative Filtering-based Recommendations 57
  • 58. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Fitting the Matrix Factorization Model 58
  • 59. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Show ItemFactors Matrix from ALS 59
  • 60. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Show UserFactors Matrix from ALS 60
  • 61. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Generating Individual Recommendations 61
  • 62. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Generating Batch Recommendations 62
  • 63. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Clustering + Collaborative Filtering Recs Cluster matrix output from Matrix Factorization Latent features derived from user-to-item interactions Item-to-Item Similarity Cluster item-factor matrix-> User-to-User Similarity <-Cluster user-factor matrix 63
  • 64. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc Demo! Clustering + Collaborative Filtering-based Recommendations 64
  • 65. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Show ItemFactors Matrix from ALS 65
  • 66. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Convert to Item Factors -> mllib.Vector Required by K-Means Clustering Algorithm 66
  • 67. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Fit and Evaluate K-Means Cluster Model 67 Measures Closeness Of Points Within Clusters K = 5 Clusters
  • 68. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Netflix Genres and Clusters Typical Genres Documentary, Romance, Comedy, Horror, Action, Adventure Latent (Hidden) Clusters Emotionally-Independent Dramas for Hopeless Romantics Witty Dysfunctional-Family TV Animated Comedies Romantic Crime Movies based on Classic Literature Latin American Forbidden-Love Movies Critically-acclaimed Emotional Drug Movie Cerebral Military Movie based on Real Life Sentimental Movies about Horses for Ages 11-12 Gory Canadian Revenge Movies Raunchy Mad Scientist Comedy 68
  • 69. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc Demo! Personalized PageRank 69
  • 70. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Personalized PageRank 70
  • 71. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Personalized PageRank (No Outbound) 71
  • 72. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Presentation Outline ① Scaling ② Similarities ③ Recommendations ④ Approximations ⑤ Netflix Recommendations 72
  • 73. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark When to Approximate? Memory or time constrained queries Relative vs. exact counts are OK (# errors between then and now) Using machine learning or graph algos Inherently probabilistic and approximate Finding topics in documents (LDA) Finding similar pairs of users, items, words at scale (LSH) Finding top influencers (PageRank) Streaming aggregations Inherently sloppy collection (exactly once?) 73 Approximate as much as you can get away with! Ask for forgiveness later !!
  • 74. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark When NOT to Approximate? If you’ve ever heard the term… “Sarbanes-Oxley” …at the office. 74
  • 75. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc A Few Good Algorithms 75 You can’t handle the approximate!
  • 76. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Common to These Algos & Data Structs Low, fixed size in memory Known error bounds Store large amount of data Less memory than Java/Scala collections Tunable tradeoff between size and error Rely on multiple hash functions or operations Size of hash range defines error 76
  • 77. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc Bloom Filter Set.contains(key): Boolean “Hash Multiple Times and Flip the Bits Wherever You Land” 77
  • 78. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Bloom Filter Approximate Set.contains(key) No means No, Yes means Maybe Elements can only be added Never updated or removed 78
  • 79. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Bloom Filter in Action 79 set(key) contains(key): Boolean Images by @avibryant Set.contains(key): TRUE -> maybe contains Set.contains(key): FALSE -> definitely does not contain.
  • 80. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc CountMin Sketch Frequency Count and TopK “Hash Multiple Times and Add 1 Wherever You Land” 80
  • 81. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark CountMin Sketch (CMS) Approximate frequency count and TopK for key ie. “Heavy Hitters” on Twitter 81 Matei Zaharia Martin Odersky Donald Trump
  • 82. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark CountMin Sketch In Action (TopK, Count) 82 Images derived from @avibryant Find minimum of all rows … … Can overestimate, but never underestimate Multiple hash functions (1 hash function per row) Binary hash output (1 element per column) x 2 occurrences of “Top Gun” for slightly additional complexity Top Gun Top Gun Top Gun (x 2) A Few Good Men Taps Top Gun (x 2) add(Top Gun, 2) getCount(Top Gun): Long Use Case: TopK movies using total views add(A Few Good Men, 1) add(Taps, 1) A Few Good Men Taps … … Overlap Top Gun Overlap A Few Good Men
  • 83. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc HyperLogLog Count Distinct “Hash Multiple Times and Uniformly Distribute Where You Land” 83
  • 84. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark HyperLogLog (HLL) Approximate count distinct Slight twist Special hash function creates uniform distribution Error estimate 14 bits for size of range m = 2^14 = 16,384 hash slots error = 1.04/(sqrt(16,384)) = .81%84 Not many of these
  • 85. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark HyperLogLog In Action (Count Distinct) Use Case: Number of distinct users who view a movie 85 0 32 Top Gun: Hour 2 user 2001 user 4009 user 3002 user 7002 user 1005 user 6001 User 8001 User 8002 user 1001 user 2009 user 3005 user 3003 Top Gun: Hour 1 user 3001 user 7009 0 16 UniformDistribution: Estimate distinct # of users by inspecting just the beginning 0 32 Top Gun: Hour 1 + 2 user 2001 user 4009 user 3002 user 7002 user 1005 user 6001 User 8001 User 8002 Combine across different scales user 7009 user 1001 user 2009 user 3005 user 3003 user 3001
  • 86. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc Locality Sensitive Hashing Set Similarity “Pre-process Items into Buckets, Compare Within Buckets” 86
  • 87. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Locality Sensitive Hashing (LSH) Approximate set similarity Hash designed to cluster similar items Avoids cartesian all-pairs comparison Pre-process m rows into b buckets b << m Hash items multiple times Similar items hash to overlapping buckets Compare just contents of buckets Much smaller cartesian … and parallel !! 87
  • 88. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc DIMSUM Set Similarity “Pre-process and ignore data that is unlikely to be similar.” 88
  • 89. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark DIMSUM “Dimension Independent Matrix Square Using MR” Remove vectors with low probability of similarity RowMatrix.columnSimiliarites(threshold) Twitter DIMSUM Case Study 40% efficiency gain over bruce-force Cosine Sim 89
  • 90. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Common Tools to Approximate Twitter Algebird Redis Apache Spark 90 Composable Library Distributed Cache Big Data Processing
  • 91. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Twitter Algebird Rooted in Algebraic Fundamentals! Parallel Associative Composable Examples Min, Max, Avg BloomFilter (Set.contains(key)) HyperLogLog (Count Distinct) CountMin Sketch (TopK Count) 91
  • 92. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Redis Implementation of HyperLogLog (Count Distinct) 12KB per item count 2^64 max # of items 0.81% error (Tunable) Add user views for given movie PFADD TopGun_HLL user1001 user2009 user3005 PFADD TopGun_HLL user3003 user1001 Get distinct count (cardinality) of set PFCOUNT TopGun_HLL Returns: 4 (distinct users viewed this movie) 92 ignore duplicates Tunable Union 2 HyperLogLog Data Structures PFMERGE TopGun_HLL Taps_HLL
  • 93. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Spark Approximations Spark Core RDD.count*Approx() Spark SQL PartialResult approxCountDistinct(column) HyperLogLogPlus Spark ML Stratified sampling PairRDD.sampleByKey(fractions: Double[ ]) DIMSUM sampling Probabilistic sampling reduces amount of shuffle RowMatrix.columnSimilarities(threshold) 93
  • 94. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc Demo! Exact Count vs. Approximate HLL and CMS Count 94
  • 95. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark HashSet vs. HyperLogLog (Memory) 95
  • 96. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark HashSet vs. CountMin Sketch (Memory) 96
  • 97. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc Demo! Exact Similarity vs. Approximate LSH Similarity 97
  • 98. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Brute Force Cartesian All Pair Similarity 98 47 seconds
  • 99. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Locality Sensitive Hash All Pair Similarity 99 6 seconds
  • 100. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc Many More Demos! or Download Docker Clone on Github 100 http://advancedspark.com
  • 101. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Presentation Outline ① Scaling ② Similarities ③ Recommendations ④ Approximations ⑤ Netflix Recommendations 101
  • 102. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc Netflix Recommendations From Ratings to Real-time 102
  • 103. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Netflix Has a Lot of Data Netflix has a lot of data about a lot of users and a lot of movies. Netflix can use this data to buy new movies. Netflix is global. Netflix can use this data to choose original programming. Netflix knows that a lot of people like politics and Kevin Spacey. 103 The UK doesn’t have White Castle. Renamed my favourite movie to: “Harold and Kumar Get the Munchies” My favorite movie: “Harold and Kumar Go to White Castle” Summary: Buy NFLX Stock! This broke my unit tests!
  • 104. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Netflix Data Pipeline - Then 104 v1.0 v2.0
  • 105. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Netflix Data Pipeline – Now (Keystone) 105 v3.0 9 million events per second 22 GB per second!! EC2 D2XL Disk: 6 TB, 475 MB/s RAM: 30 G Network: 700 Mbps Auto-scaling, Fault tolerance A/B Tests, Trending Now SAMZA Splits high and normal priority
  • 106. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Netflix Recommendation Data Pipeline 106 Throw away batch-generated user factors (U) Keep video factors (V)
  • 107. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Netflix Trending Now (Time-based Recs) Uses Spark Streaming Personalized to user (viewing history, past ratings) Learns and adapts to events (Valentine’s Day) 107 “VHS” Number of Plays Number of Impressions Calculate Take Rate
  • 108. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Bonus: Pandora Time-based Recs Work Days Play familiar music User is less likely accept new music Evenings and Weekends Play new music More like to accept new music 108
  • 109. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark $1 Million Netflix Prize (2006-2009) Goal Improve movie predictions by 10% (Root Mean Sq Error) Test data withheld to calculate RMSE upon submission 5-star Ratings Dataset (userId, movieId, rating, timestamp) Winning algorithm(s) 10.06% improvement (RMSE) Ensemble of 500+ ML combined with GBDT’s Computationally impractical 109
  • 110. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Secrets to the Winning Algorithms Adjust for the following human bias… ① Alice effect: user rates lower than avg ② Inception effect: movie rated higher than avg ③ Overall mean rating of a movie ④ Number of people who have rated a movie ⑤ Number of days since user’s first rating ⑥ Number of days since movie’s first rating ⑦ Mood, time of day, day of week, season, weather 110
  • 111. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Netflix Common ML Algorithms Logistic Regression Linear Regression Gradient Boosted Decision Trees Random Forest Matrix Factorization SVD Restricted Boltzmann Machines Deep Neural Nets Markov Models LDA Clustering 111 Ensembles!
  • 112. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Netflix Genres and Clusters Typical Genres Documentaries, Romance Comedies, Horror, Action, Adventure Latent (Hidden) Clusters Emotionally-Independent Dramas for Hopeless Romantics Witty Dysfunctional-Family TV Animated Comedies Romantic Crime Movies based on Classic Literature Latin American Forbidden-Love Movies Critically-acclaimed Emotional Drug Movie Cerebral Military Movie based on Real Life Sentimental Movies about Horses for Ages 11-12 Gory Canadian Revenge Movies Raunchy Mad Scientist Comedy 112
  • 113. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Netflix Social Integration Post to Facebook after movie start (5 mins) Recommend to new users based on friends Helps with Cold Start problem 113
  • 114. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Netflix Search No results? No problem… Show similar results! Utilize extensive DVD Catalog Metadata search (ElasticSearch) Named entity recognition (NLP) Empty searches are opportunity! Explicit feedback for future recommendations Content to buy and produce! 114
  • 115. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Higher Ratings in 2004? 2004, Netflix noticed higher ratings on average Some possible reasons why… 115 ① Significant UI improvements deployed ② New recommendation engine deployed ③
  • 116. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark Thank You, Everyone!! Chris Fregly @cfregly IBM Spark Tech Center San Francisco, California, USA http://advancedspark.com Sign up for the Meetup and Book Contribute to Github Repo Run all Demos using Docker Find me: LinkedIn, Twitter, Github, Email, Fax 116 Image derived from http://www.duchess-france.org/
  • 117. Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark http://advancedspark.com @cfregly