SlideShare a Scribd company logo
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spark & Recommendations
Spark, Streaming, Machine Learning, Graph Processing,
Approximations, Probabilistic Data Structures, NLP 
Apache Spark Maryland Meetup
Thanks to Tetra Concepts & Jailbreak Brewing Co!!
Feb 22nd, 2016
Chris Fregly
Principal Data Solutions Engineer
We’re Hiring! (Only Nice People)!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Who Am I?

Streaming Data Engineer
Netflix OSS Committer

Data Solutions Engineer

Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
Meetup Organizer
Advanced Apache Meetup
Book Author
Advanced .
Due 2016
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Recent World Tour: Freg-a-Palooza!
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit (Oct 27th)
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
Oslo Big Data Hadoop Meetup (Nov 19th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 25th)
Istanbul Spark Meetup (Nov 26th)
Budapest Spark Meetup (Nov 28th)
Singapore Spark Meetup (Dec 1st)
Sydney Spark Meetup (Dec 8th)
Melbourne Spark Meetup (Dec 9th)
Toronto Spark Meetup (Dec 14th)
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Advanced Apache Spark Meetup
Meetup Metrics
Top 5 Most-active Spark Meetup!
2600 Members in just 6 mos!!
2600 Docker downloads (demos)
Meetup Mission
Deep-dive into Spark and related open source projects
Surface key patterns and idioms
Focus on distributed systems, scale, and performance 

Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Live, Interactive Demo!!
Audience Participation Required
(cell phone or laptop)
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
End User ->

ElasticSearch ->

Spark ML ->

Data Scientist ->

<- Kafka

<- Spark


<- Cassandra,

<- Zeppelin,
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
  Scaling with Parallelism and Composability

  Similarity and Recommendations
  When to Approximate
  Common Algorithms and Data Structures

  Common Libraries and Tools
  Netflix Recommendations and Data Pipeline
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Scaling with Parallelism
O(log n)
O(log n)
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Scaling with Composability

Max (a max b max c max d) == (a max b) max (c max d) 
Set Union (a U b U c U d) 
== (a U b) U (c U d)
Addition (a + b + c + d) 
 == (a + b) 

 (c + d)
 (a * b * c * d) 
== (a * b) * (c * d) 
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
What about Division?
(a / b / c / d) 
!= (a / b) / (c / d)

(3 / 4 / 7 / 8) 
!= (3 / 4) / (7 / 8) 

 (((3 / 4) / 7) / 8)
!= ((3 * 8) / (4 * 7)) 


What were the Egyptians thinking?!
Not Composable
“Divide like
an Egyptian”
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
What about Average?
Overall AVG ( 

[3, 1] 
 ((3 + 5) + (5 + 7)) 

[5, 1] == ----------------------- == --- == 5

[5, 1] 
 ((1 + 2) + 1) 


[7, 1] 

Pairwise AVG

 (3 + 5) (5 + 7) 8 12 20

 ------- + ------- == --- + --- == --- == 10 != 5

 2 2
Divide, Add, Divide?
Single Divide at the End?
Doesn’t need to be Composable!
AVG (3, 5, 5, 7) == 5
Add, Add, Add?
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
  Scaling with Parallelism and Composability

  Similarity and Recommendations
  When to Approximate
  Common Algorithms and Data Structures

  Common Libraries and Tools
  Netflix Recommendations and Data Pipeline
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Euclidean Similarity
Exists in Euclidean, flat space
Based on Euclidean distance 
Linear measure
Bias towards magnitude
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Cosine Similarity
Angular measure
Adjusts for Euclidean magnitude bias
Normalizes to unit vectors
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Jaccard Similarity
Set similarity measurement
Set intersection / set union ->
Based on Jaccard distance
Bias towards popularity
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Log Likelihood Similarity
Adjusts for popularity bias
Netflix “Shawshank” problem
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Word Similarity
Based on edit distance
Calculate char differences between words
Deletes, transposes, replaces, inserts
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Document Similarity

Term Freq / Inverse Document Freq

Used by most search engines


Words embedded in vector space nearby similars

Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Similarity Pathway
ie. Closest recommendations between 2 people
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Calculating Similarity
Exact Brute-Force

“All-pairs similarity” 

aka “Pair-wise similarity”, “Similarity join”

Cartesian O(n^2) shuffle and comparison



Bucketing (aka “Partitioning”, “Clustering”)

Remove data with low probability of similarity

Reduce shuffle and comparisons
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Bonus: Document Summary
Text Rank

aka “Sentence Rank”

TF/IDF + Similarity Graph + PageRank


Surface summary sentences (abstract)

Most similar to all others (TF/IDF + Similarity Graph)

Most influential sentences (PageRank)
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Similarity Graph
Vertex is movie, tag, actor, plot summary, etc.
Edges are relationships and weights
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Topic-Sensitive PageRank
Graph diffusion algorithm
Pre-process graph, add vector of probabilities to each vertex

Probability of landing at this vertex from every other vertex
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Basic Terminology
User: User seeking recommendations
Item: Item being recommended
Explicit User Feedback: like, rating, movie view, profile read, search
Implicit User Feedback: click, hover, scroll, navigation
Instances: Rows of user feedback/input data
Overfitting: Training a model too closely to the training data & hyperparameters
Hold Out Split: Holding out some of the instances to avoid overfitting 
Features: Columns of instance rows (of feedback/input data)
Cold Start Problem: Not enough data to personalize (new)
Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations)
Model Evaluation: Compare predictions to actual values of hold out split
Feature Engineering: Modify, reduce, combine features
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Binary: True or False
Numeric Discrete: Integers
Numeric: Real Values
Binning: Convert Continuous into Discrete (Time of Day->Morning, Afternoon)
Categorical Ordinal: Size (Small->Medium->Large), Ratings (1->5)
Categorical Nominal: Independent, Favorite Sports Teams, Dating Spots
Temporal: Time-based, Time of Day, Binge Viewing
Text: Movie Titles, Genres, Tags, Reviews (Tokenize, Stop Words, Stemming)
Media: Images, Audio, Video
Geographic: (Longitude, Latitude), Geohash
Latent: Hidden Features within Data (Collaborative Filtering)
Derived: Age of Movie, Duration of User Subscription
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Feature Engineering
Dimension Reduction

Reduce number of features in feature space

Principle Component Analysis (PCA)

Help find principle features that best describe variance in data

Peel the dimensional layers back until you describe the data

One-Hot Encoding

Convert nominal categorical feature values to 0’s, 1’s

Remove numerical relationship between the categories

-> 1 
Bears -> 

49’ers -> 2 
49’ers ->

Steelers-> 3 
Steelers-> [0,0,1]
1 binary column 
per category
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Normalize and Standardize Features

Scale features to standard size

Required by many ML algos
Normalize Features

Calculate L1 (or L2, etc) norm

Divide elements by norm
Standardize Features

Apply standard normal transformation

Mean == 0

StdDev == 1 

Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Non-Personalized Recommendations
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Cold Start Problem
“Cold Start” problem

New user, don’t know their preference, must show something!

Movies with highest-rated actors

Top K Aggregations


Most desirable singles

PageRank of likes and dislikes

Facebook social graph

Friend-based recommendations
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Personalized Recommendations
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Clustering (aka. Nearest Neighbors)
User-to-User Clustering (User Behavior)

Similar items viewed or rated

Similar viewing pattern (ie. binge or casual)
Item-to-Item Clustering (Item Description)

Similar item tags/metadata (Jaccard Similiarity, Locality Sensitive Hash)

Similar profile text and categories (TF/IDF, Word2Vec, NLP)

Similar images/facial structures (Convolutional Neural Nets, Eigenfaces)

33 OKCupid Profile
 My Hinge Profile
Site ->
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Bonus: NLP Conversation Bot
“If your responses to my generic opening
lines are positive, I may read your profile.” 

Spark ML and Stanford CoreNLP:
TF/IDF, DecisionTrees, Sentiment
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
User-to-Item Collaborative Filtering
Matrix Factorization
①  Factor the large matrix (left) into 2 smaller matrices (right)
②  Smaller matrices, when multiplied, approximate original
③  Fill in the missing values with in the large matrix
④  Surface latent features from within user-item interaction
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Item-to-Item Collaborative Filtering
Made famous by Amazon Paper ~2003


As # of users grew, user-item collab filtering didn’t scale



Generate itemId -> List[userId] vectors


For each item in cart, recommend similar items from vector space

Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
  Scaling with Parallelism and Composability

  Similarity and Recommendations
  When to Approximate
  Common Algorithms and Data Structures

  Common Libraries and Tools
  Netflix Recommendations and Data Pipeline
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
When to Approximate?
Memory or time constrained queries

Relative vs. exact counts are OK (# errors between then and now)
Using machine learning or graph algos

Inherently probabilistic and approximate

Finding topics in documents (LDA)

Finding similar pairs of users, items, words at scale (LSH)

Finding top influencers (PageRank)
Streaming aggregations

Inherently sloppy collection (exactly once?)
Approximate as much as you can get away with!
Ask for forgiveness later !!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
When NOT to Approximate?
If you’ve ever heard the term…


…at the office after 2002.
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
  Scaling with Parallelism and Composability

  Similarity and Recommendations
  When to Approximate
  Common Algorithms and Data Structures

  Common Libraries and Tools
  Netflix Recommendations and Data Pipeline
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
A Few Good Algorithms
You can’t handle 

the approximate!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Common to These Algos & Data Structs
Low, fixed size in memory
Known error bounds
Store large amount of data
Less memory than Java/Scala collections
Tunable tradeoff between size and error
Rely on multiple hash functions or operations
Size of hash range defines error
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Bloom Filter
Set.contains(key): Boolean
“Hash Multiple Times and Flip the Bits Wherever You Land”
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Bloom Filter
Approximate set membership for key

False positive: expect contains(), actual !contains()

True negative: expect !contains(), actual !contains()

Elements are only added, never removed
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Bloom Filter in Action
 contains(key): Boolean
Images by @avibryant
TRUE -> maybe contains
FALSE -> definitely does not contain.
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CountMin Sketch
Frequency Count and TopK
“Hash Multiple Times and Add 1 Wherever You Land”
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CountMin Sketch (CMS)
Approximate frequency count and TopK for key 
ie. “Heavy Hitters” on Twitter
Matei Zaharia
 Martin Odersky
 Donald Trump
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CountMin Sketch In Action (TopK, Count)
Images derived from @avibryant
Find minimum of all rows
Can overestimate, 

but never underestimate
Multiple hash functions
(1 hash function per row)
Binary hash output
(1 element per column)
x 2 occurrences of 
“Top Gun” for slightly 
additional complexity
Top Gun
Top Gun
Top Gun
(x 2)
A Few

Good Men
Top Gun
(x 2)
add(Top Gun, 2)
getCount(Top Gun): Long
Use Case: TopK movies using total views
add(A Few Good Men, 1)
add(Taps, 1)
A Few

Good Men
Overlap Top Gun
Overlap A Few Good Men
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Count Distinct
“Hash Multiple Times and Uniformly Distribute Where You Land”
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
HyperLogLog (HLL)
Approximate count distinct

Slight twist

Special hash function creates uniform distribution

Error estimate

14 bits for size of range

m = 2^14 = 16,384 hash slots

error = 1.04/(sqrt(16,384)) = .81% 
Not many of these
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
HyperLogLog In Action (Count Distinct)
Use Case: Number of distinct users who view a movie
Top Gun: Hour 2

Top Gun: Hour 1
Uniform Distribution:
Estimate distinct # of users by 
inspecting just the beginning
Top Gun: Hour 1 + 2

Combine across 
different scales
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Locality Sensitive Hashing
Set Similarity
“Pre-process Items into Buckets, Compare Within Buckets”
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Locality Sensitive Hashing (LSH)
Approximate set similarity
Hash designed to cluster similar items
Avoids cartesian all-pairs comparison
Pre-process m rows into b buckets

b << m
Hash items multiple times

Similar items hash to overlapping buckets
Compare just contents of buckets

Much smaller cartesian … and parallel !!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Set Similarity
“Pre-process and ignore data that is unlikely to be similar.”
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
“Dimension Independent Matrix Square Using MR”
Remove vectors with low probability of similarity

Twitter DIMSUM Case Study

40% efficiency gain over bruce-force Cosine Sim
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
  Scaling with Parallelism and Composability

  Similarity and Recommendations
  When to Approximate
  Common Algorithms and Data Structures

  Common Libraries and Tools
  Netflix Recommendations and Data Pipeline
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Common Tools to Approximate
Twitter Algebird
Apache Spark
Composable Library
Distributed Cache
Big Data Processing
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Twitter Algebird
Rooted in Algebraic Fundamentals!

Min, Max, Avg

BloomFilter (Set.contains(key))

HyperLogLog (Count Distinct)

CountMin Sketch (TopK Count)

Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Implementation of HyperLogLog (Count Distinct)
12KB per item count
2^64 max # of items
0.81% error (Tunable)

Add user views for given movie

PFADD TopGun_HLL user1001 user2009 user3005

PFADD TopGun_HLL user3003 user1001

Get distinct count (cardinality) of set


Returns: 4 (distinct users viewed this movie)

ignore duplicates
Union 2 HyperLogLog Data Structures
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spark Approximations
Spark Core

Spark SQL


approxCountDistinct(column), HyperLogLogPlus
Spark ML

Stratified sampling

PairRDD.sampleByKey(fractions: Double[ ])

DIMSUM sampling

Probabilistic sampling reduces amount of comparison shuffle

Spark Streaming

A/B testing

Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Exact Count vs. Approx HyperLogLog, CountMin Sketch
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
HashSet vs. HyperLogLog (Memory)
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
HashSet vs. CountMin Sketch (Memory)
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Set Similarity
Bruce Force vs. Locality Sensitive Hashing Similarity
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Brute Force Cartesian All Pair Similarity
47 seconds
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Locality Sensitive Hash All Pair Similarity
6 seconds
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Many More Demos!

Download Docker 
Clone Github
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
  Scaling with Parallelism and Composability

  Similarity and Recommendations
  When to Approximate
  Common Algorithms and Data Structures

  Common Libraries and Tools
  Netflix Recommendations and Data Pipeline
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Netflix Recommendation & Data Pipeline
From 5 Stars to Trending Now
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Netflix Has a Lot of Data
Netflix has a lot of data about a lot of users and a lot of movies.

Netflix can use this data to buy new movies.

Netflix is global.

Netflix can use this data to choose original programming.

Netflix knows that a lot of people like politics and Kevin Spacey.
The UK doesn’t have White Castle.
Renamed my favourite movie to: 
“Harold and Kumar Get the Munchies”
My favorite movie:
“Harold and Kumar 

Go to White Castle”
Summary: Buy NFLX Stock! 
This broke my unit tests!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
$1 Million Netflix Prize (2006-2009)

Improve movie predictions by 10% (RMSE)


(userId, movieId, rating, timestamp)

Test data withheld to calculate RMSE upon submission

Winning algorithm

10.06% improvement (RMSE)

Ensemble of 500+ ML combined with GBDT’s

Computationally impractical
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Secrets to the Winning Algorithms
Adjust for the following human bias…

① Alice Effect: rate lower than average user
② Inception Effect: rated higher than average movie
③ Overall mean rating of a movie
④ Number of people who have rated a movie
⑤ Mood, time of day, day of week, season, weather
⑥ Number of days since user’s first rating
⑦ Number of days since movie’s first rating
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Netflix Data Pipeline - Then
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Netflix Data Pipeline - Now
8 million events per second
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Netflix Recommendation Pipeline
Throw away 
user factors (U)
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Netflix Common ML Algorithms
Logistic Regression
Linear Regression
Gradient Boosted Decision Trees
Random Forest
Matrix Factorization
Restricted Boltzmann Machines
Deep Neural Nets
Markov Models
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Netflix Trending Now
Time of day
Personalized to user (viewing history, past ratings)
Personalized to events (Valentine’s Day)
Number of 
Number of 
Take Rate
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Bonus: Pandora Time of Day Recs
Work Days

Play familiar music

User is less likely accept new music

Evenings and Weekends

Play new music

More like to accept new music
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Netflix Social Integration
Post to Facebook after movie start (5 mins)
Recommend without needing viewing history
Helps with Cold Start problem
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Netflix Search
No results? No problem… Show similar results!

Empty searches are good!

Explicit feedback for future recommendations

Content to buy and produce!

Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Bonus: Netflix in 2004
Netflix noticed people started to rate movies higher!?
Significant UI improvements made around that time
Recommendation improvements (Cinematch)

Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Thank You!!
Chris Fregly @cfregly
IBM Spark Tech Center
San Francisco, California, USA
Sign up for the Meetup and Book
Contribute to Github Repo
Run all Demos using Docker
Find me: LinkedIn, Twitter, Github, Email, Fax
Image derived from
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Power of data. Simplicity of design. Speed of innovation.
IBM Spark

More Related Content

What's hot

Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Chris Fregly
Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015
Chris Fregly
Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015
Chris Fregly
Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016
Chris Fregly
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
Athens Big Data
Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015
Chris Fregly
Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015
Chris Fregly
Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015
Chris Fregly
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Chris Fregly
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Chris Fregly
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Chris Fregly
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016
Chris Fregly
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Chris Fregly
Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015
Chris Fregly
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Chris Fregly
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
Chris Fregly
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
Chris Fregly
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Chris Fregly
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Chris Fregly

What's hot (20)

Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015
Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015
Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015
Similarity at Scale
Similarity at ScaleSimilarity at Scale
Similarity at Scale
Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015
Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...

Viewers also liked

IoT Connected Brewery
IoT Connected BreweryIoT Connected Brewery
IoT Connected Brewery
Jason Hubbard
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Chris Fregly
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
Jason Hubbard
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache Spark
Marilyn Waldman
Devops Spark Streaming
Devops Spark StreamingDevops Spark Streaming
Devops Spark Streaming
Marilyn Waldman
Scala training workshop 02
Scala training workshop 02Scala training workshop 02
Scala training workshop 02
Nguyen Tuan
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
Spark Summit
Introduction to big data and apache spark
Introduction to big data and apache sparkIntroduction to big data and apache spark
Introduction to big data and apache spark
Mohammed Guller
Spark Technology Center IBM
Spark Technology Center IBMSpark Technology Center IBM
Spark Technology Center IBM
DataWorks Summit/Hadoop Summit
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
Turi, Inc.
Atlanta Hadoop Users Meetup 09 21 2016
Atlanta Hadoop Users Meetup 09 21 2016Atlanta Hadoop Users Meetup 09 21 2016
Atlanta Hadoop Users Meetup 09 21 2016
Chris Fregly
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
Miklos Christine
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
Chris Fregly
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
Yahoo Developer Network
Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016
Chris Fregly
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?
Eyal Ben Ivri
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object stores
Steve Loughran
Distributed ML in Apache Spark
Distributed ML in Apache SparkDistributed ML in Apache Spark
Distributed ML in Apache Spark

Viewers also liked (20)

IoT Connected Brewery
IoT Connected BreweryIoT Connected Brewery
IoT Connected Brewery
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache Spark
Devops Spark Streaming
Devops Spark StreamingDevops Spark Streaming
Devops Spark Streaming
Scala training workshop 02
Scala training workshop 02Scala training workshop 02
Scala training workshop 02
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
Introduction to big data and apache spark
Introduction to big data and apache sparkIntroduction to big data and apache spark
Introduction to big data and apache spark
Spark Technology Center IBM
Spark Technology Center IBMSpark Technology Center IBM
Spark Technology Center IBM
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
Atlanta Hadoop Users Meetup 09 21 2016
Atlanta Hadoop Users Meetup 09 21 2016Atlanta Hadoop Users Meetup 09 21 2016
Atlanta Hadoop Users Meetup 09 21 2016
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object stores
Distributed ML in Apache Spark
Distributed ML in Apache SparkDistributed ML in Apache Spark
Distributed ML in Apache Spark

Similar to Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC Meetup Feb 22 2016

Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Chris Fregly
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Chris Fregly
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
Krishna Sankar
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Chris Fregly
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Chris Fregly
Intro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the CloudIntro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the Cloud
Daniel Zivkovic
Chetan Khatri
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherTactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark Together
Network and IT Operations
Network and IT OperationsNetwork and IT Operations
Network and IT Operations

Similar to Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC Meetup Feb 22 2016 (15)

Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Intro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the CloudIntro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the Cloud
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherTactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark Together
Network and IT Operations
Network and IT OperationsNetwork and IT Operations
Network and IT Operations

More from Chris Fregly

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and Data
Chris Fregly
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdf
Chris Fregly
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Chris Fregly
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Chris Fregly
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
Chris Fregly
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Chris Fregly
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
Chris Fregly
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
Chris Fregly
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:Cap
Chris Fregly
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
Chris Fregly
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Chris Fregly
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Chris Fregly
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Chris Fregly
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
Chris Fregly
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
Chris Fregly
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Chris Fregly
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
Chris Fregly
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Chris Fregly
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Chris Fregly
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
Chris Fregly

More from Chris Fregly (20)

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and Data
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdf
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:Cap
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...

Recently uploaded

Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Hivelance Technology
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
Jelle | Nordend
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
Peter Caitens
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
Tier1 app
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam

Recently uploaded (20)

Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam

Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC Meetup Feb 22 2016

  • 1. Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark & Recommendations Spark, Streaming, Machine Learning, Graph Processing, Approximations, Probabilistic Data Structures, NLP Apache Spark Maryland Meetup Thanks to Tetra Concepts & Jailbreak Brewing Co!! Feb 22nd, 2016 Chris Fregly Principal Data Solutions Engineer We’re Hiring! (Only Nice People)!
  • 2. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Who Am I? 2 Streaming Data Engineer Netflix OSS Committer
 Data Solutions Engineer
 Apache Contributor Principal Data Solutions Engineer IBM Technology Center Meetup Organizer Advanced Apache Meetup Book Author Advanced . Due 2016
  • 3. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Recent World Tour: Freg-a-Palooza! London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th) Barcelona Spark Meetup (Oct 20th) Madrid Big Data Meetup (Oct 22nd) Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit (Oct 27th) Brussels Spark Meetup (Oct 30th) Zurich Big Data Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) 3 Oslo Big Data Hadoop Meetup (Nov 19th) Helsinki Spark Meetup (Nov 20th) Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 25th) Istanbul Spark Meetup (Nov 26th) Budapest Spark Meetup (Nov 28th) Singapore Spark Meetup (Dec 1st) Sydney Spark Meetup (Dec 8th) Melbourne Spark Meetup (Dec 9th) Toronto Spark Meetup (Dec 14th)
  • 4. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Advanced Apache Spark Meetup Meetup Metrics Top 5 Most-active Spark Meetup! 2600 Members in just 6 mos!! 2600 Docker downloads (demos) Meetup Mission Deep-dive into Spark and related open source projects Surface key patterns and idioms Focus on distributed systems, scale, and performance 4
  • 5. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Live, Interactive Demo!! Audience Participation Required (cell phone or laptop) 5
  • 6. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark End User -> ElasticSearch -> Spark ML -> Data Scientist -> 6 <- Kafka <- Spark
 Streaming <- Cassandra, Redis <- Zeppelin, iPython
  • 7. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline   Scaling with Parallelism and Composability   Similarity and Recommendations   When to Approximate   Common Algorithms and Data Structures   Common Libraries and Tools   Netflix Recommendations and Data Pipeline 7
  • 8. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Scaling with Parallelism 8 Peter O(log n) O(log n)
  • 9. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Scaling with Composability Max (a max b max c max d) == (a max b) max (c max d) Set Union (a U b U c U d) == (a U b) U (c U d) Addition (a + b + c + d) == (a + b) + (c + d) Multiply (a * b * c * d) == (a * b) * (c * d) Division?? 9
  • 10. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark What about Division? Division (a / b / c / d) != (a / b) / (c / d) (3 / 4 / 7 / 8) != (3 / 4) / (7 / 8) (((3 / 4) / 7) / 8) != ((3 * 8) / (4 * 7)) 0.134 != 0.857 10 What were the Egyptians thinking?! Not Composable “Divide like an Egyptian”
  • 11. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark What about Average? Overall AVG ( [3, 1] ((3 + 5) + (5 + 7)) 20 [5, 1] == ----------------------- == --- == 5 [5, 1] ((1 + 2) + 1) 4 [7, 1] ) 11 value count Pairwise AVG (3 + 5) (5 + 7) 8 12 20 ------- + ------- == --- + --- == --- == 10 != 5 2 2 2 2 2 Divide, Add, Divide? Not Composable Single Divide at the End? Doesn’t need to be Composable! AVG (3, 5, 5, 7) == 5 Add, Add, Add? Composable!
  • 12. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline   Scaling with Parallelism and Composability   Similarity and Recommendations   When to Approximate   Common Algorithms and Data Structures   Common Libraries and Tools   Netflix Recommendations and Data Pipeline 12
  • 13. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Similarity 13
  • 14. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Euclidean Similarity Exists in Euclidean, flat space Based on Euclidean distance Linear measure Bias towards magnitude 14
  • 15. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Cosine Similarity Angular measure Adjusts for Euclidean magnitude bias 15 Normalizes to unit vectors
  • 16. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Jaccard Similarity Set similarity measurement Set intersection / set union -> Based on Jaccard distance Bias towards popularity 16
  • 17. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Log Likelihood Similarity Adjusts for popularity bias Netflix “Shawshank” problem 17
  • 18. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Word Similarity Based on edit distance Calculate char differences between words Deletes, transposes, replaces, inserts 18
  • 19. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Document Similarity TD/IDF Term Freq / Inverse Document Freq Used by most search engines Word2Vec Words embedded in vector space nearby similars 19
  • 20. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Similarity Pathway ie. Closest recommendations between 2 people 20
  • 21. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Calculating Similarity Exact Brute-Force “All-pairs similarity” aka “Pair-wise similarity”, “Similarity join” Cartesian O(n^2) shuffle and comparison Approximate Sampling Bucketing (aka “Partitioning”, “Clustering”) Remove data with low probability of similarity Reduce shuffle and comparisons 21
  • 22. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Bonus: Document Summary Text Rank aka “Sentence Rank” TF/IDF + Similarity Graph + PageRank Intuition Surface summary sentences (abstract) Most similar to all others (TF/IDF + Similarity Graph) Most influential sentences (PageRank) 22
  • 23. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Similarity Graph Vertex is movie, tag, actor, plot summary, etc. Edges are relationships and weights 23
  • 24. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Topic-Sensitive PageRank Graph diffusion algorithm Pre-process graph, add vector of probabilities to each vertex Probability of landing at this vertex from every other vertex 24
  • 25. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Recommendations 25
  • 26. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Basic Terminology User: User seeking recommendations Item: Item being recommended Explicit User Feedback: like, rating, movie view, profile read, search Implicit User Feedback: click, hover, scroll, navigation Instances: Rows of user feedback/input data Overfitting: Training a model too closely to the training data & hyperparameters Hold Out Split: Holding out some of the instances to avoid overfitting Features: Columns of instance rows (of feedback/input data) Cold Start Problem: Not enough data to personalize (new) Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations) Model Evaluation: Compare predictions to actual values of hold out split Feature Engineering: Modify, reduce, combine features 26
  • 27. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Features Binary: True or False Numeric Discrete: Integers Numeric: Real Values Binning: Convert Continuous into Discrete (Time of Day->Morning, Afternoon) Categorical Ordinal: Size (Small->Medium->Large), Ratings (1->5) Categorical Nominal: Independent, Favorite Sports Teams, Dating Spots Temporal: Time-based, Time of Day, Binge Viewing Text: Movie Titles, Genres, Tags, Reviews (Tokenize, Stop Words, Stemming) Media: Images, Audio, Video Geographic: (Longitude, Latitude), Geohash Latent: Hidden Features within Data (Collaborative Filtering) Derived: Age of Movie, Duration of User Subscription 27
  • 28. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Feature Engineering Dimension Reduction Reduce number of features in feature space Principle Component Analysis (PCA) Help find principle features that best describe variance in data Peel the dimensional layers back until you describe the data One-Hot Encoding Convert nominal categorical feature values to 0’s, 1’s Remove numerical relationship between the categories Bears -> 1 Bears -> [1,0,0] 49’ers -> 2 --> 49’ers -> [0,1,0] Steelers-> 3 Steelers-> [0,0,1] 28 1 binary column per category
  • 29. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Normalize and Standardize Features Goal Scale features to standard size Required by many ML algos Normalize Features Calculate L1 (or L2, etc) norm Divide elements by norm Standardize Features Apply standard normal transformation Mean == 0 StdDev == 1 29
  • 30. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Non-Personalized Recommendations 30
  • 31. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Cold Start Problem “Cold Start” problem New user, don’t know their preference, must show something! Movies with highest-rated actors Top K Aggregations Most desirable singles PageRank of likes and dislikes Facebook social graph Friend-based recommendations 31
  • 32. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Personalized Recommendations 32
  • 33. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Clustering (aka. Nearest Neighbors) User-to-User Clustering (User Behavior) Similar items viewed or rated Similar viewing pattern (ie. binge or casual) Item-to-Item Clustering (Item Description) Similar item tags/metadata (Jaccard Similiarity, Locality Sensitive Hash) Similar profile text and categories (TF/IDF, Word2Vec, NLP) Similar images/facial structures (Convolutional Neural Nets, Eigenfaces) 33 OKCupid Profile My Hinge Profile Dating Site ->
  • 34. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Bonus: NLP Conversation Bot 34 “If your responses to my generic opening lines are positive, I may read your profile.” 
 Spark ML and Stanford CoreNLP: TF/IDF, DecisionTrees, Sentiment Analysis
  • 35. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark User-to-Item Collaborative Filtering Matrix Factorization ①  Factor the large matrix (left) into 2 smaller matrices (right) ②  Smaller matrices, when multiplied, approximate original ③  Fill in the missing values with in the large matrix ④  Surface latent features from within user-item interaction 35
  • 36. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Item-to-Item Collaborative Filtering Made famous by Amazon Paper ~2003 Problem As # of users grew, user-item collab filtering didn’t scale Solution Offline/Batch Generate itemId -> List[userId] vectors Online/Real-time For each item in cart, recommend similar items from vector space 36
  • 37. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline   Scaling with Parallelism and Composability   Similarity and Recommendations   When to Approximate   Common Algorithms and Data Structures   Common Libraries and Tools   Netflix Recommendations and Data Pipeline 37
  • 38. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark When to Approximate? Memory or time constrained queries Relative vs. exact counts are OK (# errors between then and now) Using machine learning or graph algos Inherently probabilistic and approximate Finding topics in documents (LDA) Finding similar pairs of users, items, words at scale (LSH) Finding top influencers (PageRank) Streaming aggregations Inherently sloppy collection (exactly once?) 38 Approximate as much as you can get away with! Ask for forgiveness later !!
  • 39. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark When NOT to Approximate? If you’ve ever heard the term… “Sarbanes-Oxley” …at the office after 2002. 39
  • 40. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline   Scaling with Parallelism and Composability   Similarity and Recommendations   When to Approximate   Common Algorithms and Data Structures   Common Libraries and Tools   Netflix Recommendations and Data Pipeline 40
  • 41. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark A Few Good Algorithms 41 You can’t handle 
 the approximate!
  • 42. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Common to These Algos & Data Structs Low, fixed size in memory Known error bounds Store large amount of data Less memory than Java/Scala collections Tunable tradeoff between size and error Rely on multiple hash functions or operations Size of hash range defines error 42
  • 43. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Bloom Filter Set.contains(key): Boolean “Hash Multiple Times and Flip the Bits Wherever You Land” 43
  • 44. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Bloom Filter Approximate set membership for key False positive: expect contains(), actual !contains() True negative: expect !contains(), actual !contains() Elements are only added, never removed 44
  • 45. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Bloom Filter in Action 45 set(key) contains(key): Boolean Images by @avibryant TRUE -> maybe contains FALSE -> definitely does not contain.
  • 46. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark CountMin Sketch Frequency Count and TopK “Hash Multiple Times and Add 1 Wherever You Land” 46
  • 47. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark CountMin Sketch (CMS) Approximate frequency count and TopK for key ie. “Heavy Hitters” on Twitter 47 Matei Zaharia Martin Odersky Donald Trump
  • 48. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark CountMin Sketch In Action (TopK, Count) 48 Images derived from @avibryant Find minimum of all rows … … Can overestimate, 
 but never underestimate Multiple hash functions (1 hash function per row) Binary hash output (1 element per column) x 2 occurrences of “Top Gun” for slightly additional complexity Top Gun Top Gun Top Gun (x 2) A Few
 Good Men Taps Top Gun (x 2) add(Top Gun, 2) getCount(Top Gun): Long Use Case: TopK movies using total views add(A Few Good Men, 1) add(Taps, 1) A Few
 Good Men Taps … … Overlap Top Gun Overlap A Few Good Men
  • 49. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark HyperLogLog Count Distinct “Hash Multiple Times and Uniformly Distribute Where You Land” 49
  • 50. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark HyperLogLog (HLL) Approximate count distinct Slight twist Special hash function creates uniform distribution Error estimate 14 bits for size of range m = 2^14 = 16,384 hash slots error = 1.04/(sqrt(16,384)) = .81% 50 Not many of these
  • 51. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark HyperLogLog In Action (Count Distinct) Use Case: Number of distinct users who view a movie 51 0 32 Top Gun: Hour 2 user
 2001 user 4009 user 3002 user 7002 user 1005 user 6001 User 8001 User 8002 user 1001 user 2009 user 3005 user 3003 Top Gun: Hour 1 user 3001 user 7009 0 16 Uniform Distribution: Estimate distinct # of users by inspecting just the beginning 0 32 Top Gun: Hour 1 + 2 user
 2001 user 4009 user 3002 user 7002 user 1005 user 6001 User 8001 User 8002 Combine across different scales user 7009 user 1001 user 2009 user 3005 user 3003 user 3001
  • 52. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Locality Sensitive Hashing Set Similarity “Pre-process Items into Buckets, Compare Within Buckets” 52
  • 53. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Locality Sensitive Hashing (LSH) Approximate set similarity Hash designed to cluster similar items Avoids cartesian all-pairs comparison Pre-process m rows into b buckets b << m Hash items multiple times Similar items hash to overlapping buckets Compare just contents of buckets Much smaller cartesian … and parallel !! 53
  • 54. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark DIMSUM Set Similarity “Pre-process and ignore data that is unlikely to be similar.” 54
  • 55. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark DIMSUM “Dimension Independent Matrix Square Using MR” Remove vectors with low probability of similarity RowMatrix.columnSimiliarites(threshold) Twitter DIMSUM Case Study 40% efficiency gain over bruce-force Cosine Sim 55
  • 56. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline   Scaling with Parallelism and Composability   Similarity and Recommendations   When to Approximate   Common Algorithms and Data Structures   Common Libraries and Tools   Netflix Recommendations and Data Pipeline 56
  • 57. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Common Tools to Approximate Twitter Algebird Redis Apache Spark 57 Composable Library Distributed Cache Big Data Processing
  • 58. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Twitter Algebird Rooted in Algebraic Fundamentals! Parallel Associative Composable Examples Min, Max, Avg BloomFilter (Set.contains(key)) HyperLogLog (Count Distinct) CountMin Sketch (TopK Count) 58
  • 59. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Redis Implementation of HyperLogLog (Count Distinct) 12KB per item count 2^64 max # of items 0.81% error (Tunable) Add user views for given movie PFADD TopGun_HLL user1001 user2009 user3005 PFADD TopGun_HLL user3003 user1001 Get distinct count (cardinality) of set PFCOUNT TopGun_HLL Returns: 4 (distinct users viewed this movie) 59 ignore duplicates Tunable Union 2 HyperLogLog Data Structures PFMERGE TopGun_HLL Taps_HLL
  • 60. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark Approximations Spark Core RDD.count*Approx() Spark SQL PartialResult approxCountDistinct(column), HyperLogLogPlus Spark ML Stratified sampling PairRDD.sampleByKey(fractions: Double[ ]) DIMSUM sampling Probabilistic sampling reduces amount of comparison shuffle RowMatrix.columnSimilarities(threshold) Spark Streaming A/B testing StreamingTest.setTestMethod(“welch”).registerStream(dstream) 60
  • 61. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Demos! 61
  • 62. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Counting Exact Count vs. Approx HyperLogLog, CountMin Sketch 62
  • 63. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark HashSet vs. HyperLogLog (Memory) 63
  • 64. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark HashSet vs. CountMin Sketch (Memory) 64
  • 65. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Set Similarity Bruce Force vs. Locality Sensitive Hashing Similarity 65
  • 66. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Brute Force Cartesian All Pair Similarity 66 47 seconds
  • 67. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Locality Sensitive Hash All Pair Similarity 67 6 seconds
  • 68. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Many More Demos! or Download Docker Clone Github 68
  • 69. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline   Scaling with Parallelism and Composability   Similarity and Recommendations   When to Approximate   Common Algorithms and Data Structures   Common Libraries and Tools   Netflix Recommendations and Data Pipeline 69
  • 70. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Netflix Recommendation & Data Pipeline From 5 Stars to Trending Now 70
  • 71. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Netflix Has a Lot of Data Netflix has a lot of data about a lot of users and a lot of movies. Netflix can use this data to buy new movies. Netflix is global. Netflix can use this data to choose original programming. Netflix knows that a lot of people like politics and Kevin Spacey. 71 The UK doesn’t have White Castle. Renamed my favourite movie to: “Harold and Kumar Get the Munchies” My favorite movie: “Harold and Kumar 
 Go to White Castle” Summary: Buy NFLX Stock! This broke my unit tests!
  • 72. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark $1 Million Netflix Prize (2006-2009) Goal Improve movie predictions by 10% (RMSE) Dataset (userId, movieId, rating, timestamp) Test data withheld to calculate RMSE upon submission Winning algorithm 10.06% improvement (RMSE) Ensemble of 500+ ML combined with GBDT’s Computationally impractical 72
  • 73. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Secrets to the Winning Algorithms Adjust for the following human bias… ① Alice Effect: rate lower than average user ② Inception Effect: rated higher than average movie ③ Overall mean rating of a movie ④ Number of people who have rated a movie ⑤ Mood, time of day, day of week, season, weather ⑥ Number of days since user’s first rating ⑦ Number of days since movie’s first rating 73
  • 74. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Netflix Data Pipeline - Then 74 v1.0! v2.0!
  • 75. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Netflix Data Pipeline - Now 75 v3.0! 8 million events per second
  • 76. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Netflix Recommendation Pipeline 76 Throw away batch-generated user factors (U)
  • 77. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Netflix Common ML Algorithms Logistic Regression Linear Regression Gradient Boosted Decision Trees Random Forest Matrix Factorization SVD Restricted Boltzmann Machines Deep Neural Nets Markov Models LDA Clustering 77 Ensembles
  • 78. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Netflix Trending Now Time of day Personalized to user (viewing history, past ratings) Personalized to events (Valentine’s Day) 78 “VHS” Number of Plays Number of Impressions Calculate Take Rate
  • 79. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Bonus: Pandora Time of Day Recs Work Days Play familiar music User is less likely accept new music Evenings and Weekends Play new music More like to accept new music 79
  • 80. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Netflix Social Integration Post to Facebook after movie start (5 mins) Recommend without needing viewing history Helps with Cold Start problem 80
  • 81. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Netflix Search No results? No problem… Show similar results! Empty searches are good! Explicit feedback for future recommendations Content to buy and produce! 81
  • 82. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Bonus: Netflix in 2004 Netflix noticed people started to rate movies higher!? Why? Significant UI improvements made around that time Recommendation improvements (Cinematch) 82
  • 83. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Thank You!! Chris Fregly @cfregly IBM Spark Tech Center San Francisco, California, USA Sign up for the Meetup and Book Contribute to Github Repo Run all Demos using Docker Find me: LinkedIn, Twitter, Github, Email, Fax 83 Image derived from
  • 84. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark @cfregly