Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark

Chris Fregly
Chris FreglyAI and Machine Learning @ AWS, O'Reilly Author @ Data Science on AWS, Founder @ PipelineAI, Formerly Databricks, Netflix,
After Dark
Generating High-Quality Recommendations using
Real-time Advanced Analytics and Machine Learning with
Chris Fregly
chris@fregly.com
Who am I?
Streaming Platform Engineer
Streaming Data Engineer
Netflix Open Source Committer
Data Solutions Engineer
Apache Spark Contributor
Spark Author
Consultant, Trainer
2
advancedspark.com
Why After Dark?
Playboy After Dark
Late 1960’s TV Show
Progressive Show For Its Time
And it rhymes!!
3
What is ?
4
Spark Core
Spark
Streaming
real-timeSpark SQL
structured data
MLlib
machine
learning
GraphX
graph
analytics
…
BlinkDB
approx queries
in Production
5
What is ?
6
Founded by the creators of
as a Service
Amazon AWS based
Powerful Visualizations
Collaborative Notebooks
Scala/Java, Python, SQL, R
Flexible Cluster Management
Job Scheduling and Monitoring
7
①Generate high-quality recommendations
②Demonstrate Spark high-level libraries:
③ Spark Streaming -> Kafka, Approximates
④ Spark SQL -> DataFrames, Cassandra
① GraphX -> PageRank, Shortest Path
① MLlib -> Matrix Factor, Word2Vec
Goals of After Dark?
Images courtesy of tinder.com. Not affiliated with Tinder in any way!
Popular Dating Sites
8
Focus of This Talk
9
①Parallelism
②Performance
③Real-time Streaming
④Approximations
⑤Similarity Measures
Spark and…
Parallelism
10
Brady Bunch circa 1980
11
Season 5, Episode 18: “Two Petes in a Pod”
Parallel Algorithm : O(log n)
12
Non-parallel Algorithm : O(n)
13
Spark is Parallel
14
Performance
15
Daytona Gray Sort Contest
16
On-disk only
250,000 partitions
No in-memory caching
(2014)(2013) (2014)
Improved Shuffle and Network Layer
17
①“Sort-based shuffle”
②Minimize OS resources
③Switched to async Netty
④Keep CPUs hot
⑤Reuse byte buffers to minimize GC
⑥Use epoll for I/O to stay in kernel space
Project Tungsten: CPU and Memory
18
①More JVM bytecode generation, JIT optimize
②CPU-cache-aware data structs and algos
->
③Custom memory management
Serializers HashMap
DataFrames and Catalyst
19
19
https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/
Please
Use DataFrames!!
-->
JVM bytecode
generation
Columnar Storage Format
20
*Skip whole chunks with min-max heuristics
stored in each chunk (sorted data only)
Parquet File Format
21
①Based on Google Dremel Paper
②Implemented by Twitter and Cloudera
③Columnar storage format
④Optimized for fast columnar aggregations
⑤Tight compression
⑥Supports pushdowns
⑦Nested, self-describing, evolving schema
Types of Compression
22
①Run Length Encoding
Repeated data
②Dictionary Encoding
Fixed set of values
③Delta, Prefix Encoding
Sorted dataset
Types of Pushdowns
23
①Column, Partition Pruning
②Row, Predicate Filtering
Real-time Streaming
24
Direct Kafka Streaming (KafkaRDD)
① No single Receiver, no Write Ahead Log (WAL)
② Workers pull from Kafka in parallel
③ Each KafkaRDD partition stores relevant offsets
④ Upon Worker Node failure, rebuild from offsets
⑤ Optimizes happy path by avoiding the WAL
25
At least once
delivery guarantee
<--
Approximations
26
Count Min Sketch
27
① Approximate counters
② Better than HashMap
③ Low, fixed memory
④ Known error bounds
⑤ Large num of counters
⑥ Available in Twitter’s Algebird
⑦ Streaming example in Spark codebase
HyperLogLog
28
① Measures set cardinality
Approx count distinct
② Low memory
1.5KB @ 2% error
10^9 elements!
③ From Twitter’s Algebird
④ Streaming example in Spark codebase
⑤ RDD: countApproxDistinctByKey()
10 Recommendations
29
Types of Recommendations
30
①Non-personalized (2 out of 10)
Cold Start
No preference or behavior data for user, yet
②Personalized (8 out of 10)
User-Item Similarity
Items that others with similar prefs have
liked
Item-Item Similarity
Interactive Demo!
31
Audience Participation Needed!
32
①Navigate to sparkafterdark.com
②Click 3 actors and 3 actresses
->
You are here
->
Non-personalized
Recommendations
33
Summary Statistics and Aggregations
34
①Top Users by Like Count
“I might like users with the highest sum aggregation
of likes overall.”
SparkSQL + DataFrame: Aggregations
Like Graph Analysis
35
②Top Influencers by Like Graph
“I might like users who have the highest probability of
me liking them randomly while walking the like graph.”
GraphX: PageRank
Demo!
Spark SQL + DataFrames + GraphX
36
Similarity Measures
37
Types of Similarity
38
①Euclidean: linear measure
Magnitude bias
②Cosine: angle measure
Adjust for magnitude bias
③Jaccard: Set intersection divided by union
Popularity bias
④Log Likelihood
Adjust for pop. bias
Ali Matei Reynold Patrick Andy
Kimberly 1 1 1 1
Leslie 1 1
Meredith 1 1 1
Lisa 1 1 1
Holden 1 1 1 1 1
z
All-pairs Similarity Measure
39
①Compare everything to everything
②aka. “pair-wise similarity” or “similarity join”
③Naïve shuffle: O(m*n^2); m=rows, n=cols
④Minimize shuffle: reduce data size & approx
Reduce m (rows)
Sampling and bucketing
Reduce n (cols)
Remove most frequent value (0?)
Sampling Algo: DIMSUM
40
①"Dimension Independent Matrix Square
Using MR”
②Remove rows with low similarity probability
③MLlib: RowMatrix.columnSimilarities(…)
④Twitter: 40% efficiency gain over Cosine
Bucket Algo: Locality Sensitive Hashing
41
① Split into b buckets using similarity hash algo
Requires pre-processing of data
② Compare bucket contents in parallel
③ Converts O(m*n^2) -> O(m*n/b*b^2);
m=rows, n=cols, b=buckets
④ Example: 500k x 500k matrix
O(1.25E17) -> O(1.25E13); b=50
⑤ github.com/mrsqueeze/spark-hash
MLlib: SparseVector vs. DenseVector
42
① Remove columns using sparse vectors
② Converts O(m*n^2) -> O(m*nnz^2);
nnz=num nonzeros, nnz << n
Tip: Choose most frequent value … may not be 0
Personalized
Recommendations
43
Personalized Recommendation Terms
44
①User
User seeking likeable recommendations
②Item
User who has been liked
*Also a user seeking likeable recommendations!
③Types of Feedback
Explicit: rating, like
Implicit: search, click, hover, view, scroll
Collaborative Filtering Personalized Recs
45
③Like behavior of similar users
“I like the same people that you like.
What other people did you like that I haven’t seen?”
MLlib: Matrix Factorization, User-Item Similarity
Text-based Personalized Recs
46
④Similar profiles to each other
“Our profiles have similar, unique k-skip n-grams.
We might like each other.”
MLlib: Word2Vec, TF/IDF, Doc Similarity
More Text-based Personalized Recs
47
⑤Similar profiles from my past likes
“Your profile shares a similar feature vector space to
others that I’ve liked. I might like you.”
MLlib: Word2Vec, TF/IDF, Doc Similarity
More Text-based Personalized Recs
48
⑥Relevant, High-Value Emails
“Your initial email has similar named entities to my profile.
I might like you just for making the effort.”
MLlib: Word2Vec, TF/IDF, Entity Recognition
^
Her Email< My Profile
Personalized Recommendations:
The Future
49
Facial Recognition
50
⑦Eigenfaces
“Your face looks similar to others that I’ve liked.
I might like you.”
MLlib: RowMatrix, PCA, Item-Item Similarity
Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
Conversation Starter Bot
51
⑧NLP and DecisionTrees
“If your responses to my trite opening lines are positive,
I might actually read your profile.”
MLlib: TF/IDF, DecisionTree,
Sentiment Analysis
Positive
response ->
Negative
<- response
Image courtesty of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
52
Maintaining the
Compromise Recommendations (Couples)
53
⑨Pathway of Similarity
“I want Mad Max. You want Message In a Bottle.
Let’s find something in between to watch tonight.”
MLlib: RowMatrix, Item-Item Similarity
GraphX: Nearest Neighbors, Shortest Path
similar similar
plots -> <- actors
… …
54
⑩ The Final Recommendation
⑩ Get Off The Computer and Meet People!
linkedin.com/in/cfregly
github.com/cfregly
chris@fregly.com
@cfregly
55
Thank you!
Image courtesy of http://www.duchess-france.org/
Free trial at databricks.com
Try !!
1 of 55

More Related Content

What's hot(20)

Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath1.6K views
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
Lucidworks11.5K views
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
Databricks11.8K views
Deploying Machine Learning Models to ProductionDeploying Machine Learning Models to Production
Deploying Machine Learning Models to Production
Anass Bensrhir - Senior Data Scientist16.4K views
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
Mark Miller5.7K views

Similar to Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark(20)

Spark and cassandra (Hulu Talk)Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)
Jon Haddad4.1K views
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
Peyman Mohajerian250 views
Intro to graphs for HR analyticsIntro to graphs for HR analytics
Intro to graphs for HR analytics
Rik Van Bruggen22.9K views
(DAT203) Building Graph Databases on AWS(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS
Amazon Web Services10.5K views
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
Lucidworks1.8K views
FP Days: Down the Clojure Rabbit HoleFP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit Hole
Christophe Grand692 views

More from Chris Fregly(20)

AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:Cap
Chris Fregly2.1K views

Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark

  • 1. After Dark Generating High-Quality Recommendations using Real-time Advanced Analytics and Machine Learning with Chris Fregly chris@fregly.com
  • 2. Who am I? Streaming Platform Engineer Streaming Data Engineer Netflix Open Source Committer Data Solutions Engineer Apache Spark Contributor Spark Author Consultant, Trainer 2 advancedspark.com
  • 3. Why After Dark? Playboy After Dark Late 1960’s TV Show Progressive Show For Its Time And it rhymes!! 3
  • 4. What is ? 4 Spark Core Spark Streaming real-timeSpark SQL structured data MLlib machine learning GraphX graph analytics … BlinkDB approx queries
  • 6. What is ? 6 Founded by the creators of as a Service Amazon AWS based Powerful Visualizations Collaborative Notebooks Scala/Java, Python, SQL, R Flexible Cluster Management Job Scheduling and Monitoring
  • 7. 7 ①Generate high-quality recommendations ②Demonstrate Spark high-level libraries: ③ Spark Streaming -> Kafka, Approximates ④ Spark SQL -> DataFrames, Cassandra ① GraphX -> PageRank, Shortest Path ① MLlib -> Matrix Factor, Word2Vec Goals of After Dark? Images courtesy of tinder.com. Not affiliated with Tinder in any way!
  • 9. Focus of This Talk 9 ①Parallelism ②Performance ③Real-time Streaming ④Approximations ⑤Similarity Measures Spark and…
  • 11. Brady Bunch circa 1980 11 Season 5, Episode 18: “Two Petes in a Pod”
  • 12. Parallel Algorithm : O(log n) 12
  • 16. Daytona Gray Sort Contest 16 On-disk only 250,000 partitions No in-memory caching (2014)(2013) (2014)
  • 17. Improved Shuffle and Network Layer 17 ①“Sort-based shuffle” ②Minimize OS resources ③Switched to async Netty ④Keep CPUs hot ⑤Reuse byte buffers to minimize GC ⑥Use epoll for I/O to stay in kernel space
  • 18. Project Tungsten: CPU and Memory 18 ①More JVM bytecode generation, JIT optimize ②CPU-cache-aware data structs and algos -> ③Custom memory management Serializers HashMap
  • 20. Columnar Storage Format 20 *Skip whole chunks with min-max heuristics stored in each chunk (sorted data only)
  • 21. Parquet File Format 21 ①Based on Google Dremel Paper ②Implemented by Twitter and Cloudera ③Columnar storage format ④Optimized for fast columnar aggregations ⑤Tight compression ⑥Supports pushdowns ⑦Nested, self-describing, evolving schema
  • 22. Types of Compression 22 ①Run Length Encoding Repeated data ②Dictionary Encoding Fixed set of values ③Delta, Prefix Encoding Sorted dataset
  • 23. Types of Pushdowns 23 ①Column, Partition Pruning ②Row, Predicate Filtering
  • 25. Direct Kafka Streaming (KafkaRDD) ① No single Receiver, no Write Ahead Log (WAL) ② Workers pull from Kafka in parallel ③ Each KafkaRDD partition stores relevant offsets ④ Upon Worker Node failure, rebuild from offsets ⑤ Optimizes happy path by avoiding the WAL 25 At least once delivery guarantee <--
  • 27. Count Min Sketch 27 ① Approximate counters ② Better than HashMap ③ Low, fixed memory ④ Known error bounds ⑤ Large num of counters ⑥ Available in Twitter’s Algebird ⑦ Streaming example in Spark codebase
  • 28. HyperLogLog 28 ① Measures set cardinality Approx count distinct ② Low memory 1.5KB @ 2% error 10^9 elements! ③ From Twitter’s Algebird ④ Streaming example in Spark codebase ⑤ RDD: countApproxDistinctByKey()
  • 30. Types of Recommendations 30 ①Non-personalized (2 out of 10) Cold Start No preference or behavior data for user, yet ②Personalized (8 out of 10) User-Item Similarity Items that others with similar prefs have liked Item-Item Similarity
  • 32. Audience Participation Needed! 32 ①Navigate to sparkafterdark.com ②Click 3 actors and 3 actresses -> You are here ->
  • 34. Summary Statistics and Aggregations 34 ①Top Users by Like Count “I might like users with the highest sum aggregation of likes overall.” SparkSQL + DataFrame: Aggregations
  • 35. Like Graph Analysis 35 ②Top Influencers by Like Graph “I might like users who have the highest probability of me liking them randomly while walking the like graph.” GraphX: PageRank
  • 36. Demo! Spark SQL + DataFrames + GraphX 36
  • 38. Types of Similarity 38 ①Euclidean: linear measure Magnitude bias ②Cosine: angle measure Adjust for magnitude bias ③Jaccard: Set intersection divided by union Popularity bias ④Log Likelihood Adjust for pop. bias Ali Matei Reynold Patrick Andy Kimberly 1 1 1 1 Leslie 1 1 Meredith 1 1 1 Lisa 1 1 1 Holden 1 1 1 1 1 z
  • 39. All-pairs Similarity Measure 39 ①Compare everything to everything ②aka. “pair-wise similarity” or “similarity join” ③Naïve shuffle: O(m*n^2); m=rows, n=cols ④Minimize shuffle: reduce data size & approx Reduce m (rows) Sampling and bucketing Reduce n (cols) Remove most frequent value (0?)
  • 40. Sampling Algo: DIMSUM 40 ①"Dimension Independent Matrix Square Using MR” ②Remove rows with low similarity probability ③MLlib: RowMatrix.columnSimilarities(…) ④Twitter: 40% efficiency gain over Cosine
  • 41. Bucket Algo: Locality Sensitive Hashing 41 ① Split into b buckets using similarity hash algo Requires pre-processing of data ② Compare bucket contents in parallel ③ Converts O(m*n^2) -> O(m*n/b*b^2); m=rows, n=cols, b=buckets ④ Example: 500k x 500k matrix O(1.25E17) -> O(1.25E13); b=50 ⑤ github.com/mrsqueeze/spark-hash
  • 42. MLlib: SparseVector vs. DenseVector 42 ① Remove columns using sparse vectors ② Converts O(m*n^2) -> O(m*nnz^2); nnz=num nonzeros, nnz << n Tip: Choose most frequent value … may not be 0
  • 44. Personalized Recommendation Terms 44 ①User User seeking likeable recommendations ②Item User who has been liked *Also a user seeking likeable recommendations! ③Types of Feedback Explicit: rating, like Implicit: search, click, hover, view, scroll
  • 45. Collaborative Filtering Personalized Recs 45 ③Like behavior of similar users “I like the same people that you like. What other people did you like that I haven’t seen?” MLlib: Matrix Factorization, User-Item Similarity
  • 46. Text-based Personalized Recs 46 ④Similar profiles to each other “Our profiles have similar, unique k-skip n-grams. We might like each other.” MLlib: Word2Vec, TF/IDF, Doc Similarity
  • 47. More Text-based Personalized Recs 47 ⑤Similar profiles from my past likes “Your profile shares a similar feature vector space to others that I’ve liked. I might like you.” MLlib: Word2Vec, TF/IDF, Doc Similarity
  • 48. More Text-based Personalized Recs 48 ⑥Relevant, High-Value Emails “Your initial email has similar named entities to my profile. I might like you just for making the effort.” MLlib: Word2Vec, TF/IDF, Entity Recognition ^ Her Email< My Profile
  • 50. Facial Recognition 50 ⑦Eigenfaces “Your face looks similar to others that I’ve liked. I might like you.” MLlib: RowMatrix, PCA, Item-Item Similarity Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
  • 51. Conversation Starter Bot 51 ⑧NLP and DecisionTrees “If your responses to my trite opening lines are positive, I might actually read your profile.” MLlib: TF/IDF, DecisionTree, Sentiment Analysis Positive response -> Negative <- response Image courtesty of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
  • 53. Compromise Recommendations (Couples) 53 ⑨Pathway of Similarity “I want Mad Max. You want Message In a Bottle. Let’s find something in between to watch tonight.” MLlib: RowMatrix, Item-Item Similarity GraphX: Nearest Neighbors, Shortest Path similar similar plots -> <- actors … …
  • 54. 54 ⑩ The Final Recommendation
  • 55. ⑩ Get Off The Computer and Meet People! linkedin.com/in/cfregly github.com/cfregly chris@fregly.com @cfregly 55 Thank you! Image courtesy of http://www.duchess-france.org/ Free trial at databricks.com Try !!