Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
After Dark
Generating High-Quality Recommendations using
Real-time Advanced Analytics and Machine Learning with
Chris Freg...
Who am I?
2
Data Platform Engineer
playboy.com
Streaming Platform Engineer
NetflixOSS Committer
netflix.com, github.com/Ne...
Why After Dark?
Playboy After Dark
Late 1960’s TV Show
Progressive Show For Its Time
And it rhymes!!
3
What is ?
4
Spark Core
Spark
Streaming
real-timeSpark SQL
structured data
MLlib
machine
learning
GraphX
graph
analytics
…	...
in Production
5
What is ?
6
Founded by the creators of
as a Service
Powerful Visualizations
Collaborative Notebooks
Scala/Java, Python, SQ...
in Production
7
8
① Generate high-quality recommendations
② Demonstrate Spark high-level libraries:
③  Spark Streaming -> Kafka, Approxima...
Popular Dating Sites
9
Themes of this Talk
10
① Performance
② Parallelism
③ Columnar Storage
④ Approximations
⑤ Similarity
⑥ Minimize Shuffle
Performance
11
Daytona Gray Sort Contest
12
On-disk only
250,000 partitions
No in-memory caching
(2014)(2013) (2014)
Improved Shuffle and Network Layer
13
① Introduced sort-based shuffle
Mapper maintains large buffer grouped by keys
Reduce...
Project Tungsten: CPU and Memory
14
① Largest change to Spark exec engine to date
② Cache-aware data structs and sorting
-...
DataFrames and Catalyst
15
15
https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/
Tip: U...
Parallelism
16
Brady Bunch circa 1980
17
Season 5, Episode 18: “Two Petes in a Pod”
Parallel Algorithm : O(log n)
18
O(log n)
Non-parallel Algorithm : O(n)
19
O(n)
Columnar Storage
20
Columnar Storage Format
21
*Skip whole chunks with min-max heuristics
stored in each chunk (sorted data only)
Parquet File Format
22
① Based on Google Dremel Paper
② Implemented by Twitter and Cloudera
③ Columnar storage format
④ Op...
Types of Compression
23
① Run Length Encoding
Repeated data
② Dictionary Encoding
Fixed set of values
③ Delta, Prefix Enco...
Types of Pushdowns
24
① Column, Partition Pruning
② Row, Predicate Filtering
Approximations
25
Sketch Algorithm: Count Min Sketch
26
①  Approximate counters
②  Better than HashMap
③  Fixed, low memory
④  Known error b...
Probabilistic Data Structure: HyperLogLog
27
①  Fixed memory
②  Known error distribution
③  Measures set cardinality
④  Ap...
Similarity
28
Types of Similarity
29
① Euclidean: linear measure
Magnitude bias
② Cosine: angle measure
Adjusts for magnitude bias
③ Jac...
All-pairs Similarity
30
① Compare everything to everything
② aka. “pair-wise similarity” or “similarity join”
③ Naïve shuf...
Minimize Shuffle
31
Sampling Algo: DIMSUM
32
① "Dimension Independent Matrix Square
Using MR”
② Remove rows with low similarity probability
③ ...
Bucket Algo: Locality Sensitive Hashing
33
①  Split into b buckets using similarity hash algo
Requires pre-processing of d...
MLlib: SparseVector vs. DenseVector
34
①  Remove columns using sparse vectors
②  Converts O(m*n^2) -> O(m*nnz^2);
nnz=num ...
Interactive Demo!
35
Audience Participation Needed!
36
① Navigate to sparkafterdark.com
② Click 3 actors and 3 actresses
->
You are here
->
Recommendation Terminology
37
① User
User seeking likeable recommendations
② Item
User who has been liked
*Also a user see...
Types of Recommendations
38
① Non-personalized
Cold Start
No preference or behavior data for user, yet
② Personalized
Item...
Non-personalized
Recommendations
39
Summary Statistics and Aggregations
40
① Top Users by Like Count
“I might like users with the highest sum aggregation
of l...
Like Graph Analysis
41
② Top Influencers by Like Graph
“I might like users who have the highest probability of
me liking t...
Demo!
Spark SQL + DataFrames + GraphX
42
Personalized
Recommendations
43
Collaborative Filtering Personalized Recs
44
③ Like behavior of similar users
“I like the same people that you like.
What ...
Text-based Personalized Recs
45
④ Similar profiles to each other
“Our profiles have similar, unique k-skip n-grams.
We mig...
More Text-based Personalized Recs
46
⑤ Similar profiles from my past likes
“Your profile shares a similar feature vector s...
More Text-based Personalized Recs
47
⑥ Relevant, High-Value Emails
“Your initial email has similar named entities to my pr...
Demo!
MLlib + ALS + Word2Vec + TF/IDF
48
Bonus!
The Future of Recommendations
49
Facial Recognition
50
⑦ Eigenfaces
“Your face looks similar to others that I’ve liked.
I might like you.”
MLlib: RowMatrix...
Conversation Starter Bot
51
⑧ NLP and DecisionTrees
“If your responses to my trite opening lines are positive,
I might act...
Double Bonus!
52
Maintaining the
Compromise Recommendations (Couples)
53
⑨ Similarity Pathways
“I want Mad Max. You want Message In a Bottle.
Let’s find so...
And the Final,
54
⑩ Personalized Recommendation
My Personalized Recommendation
55
⑩ Get Off Your Computer and Be Social!!
Thank you!
cfregly@databricks.com
@cfregly
Image...
Upcoming SlideShare
Loading in …5
×

IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Quality Dating Recommendations Using Advanced Real Time Analytics

996 views

Published on

Spark After Dark is a mock dating site that uses the latest Spark libraries including Spark SQL, BlinkDB, Spark Streaming, MLlib, and GraphX to generate high-quality dating recommendations for its members and blazing fast analytics for its operators. We begin with brief overview of Spark, Spark Libraries, and Spark Use Cases. In addition, we'll discuss the modern day Lambda Architecture that combines real-time and batch processing into a single system. Lastly, we present best practices for monitoring and tuning a highly-available Spark and Spark Streaming cluster. There will be many live demos covering everything from basic topics such as ETL and data ingestion to advanced topics such as streaming, sampling, approximations, machine learning, textual analysis, and graph processing.

Published in: Technology
  • Be the first to comment

IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Quality Dating Recommendations Using Advanced Real Time Analytics

  1. 1. After Dark Generating High-Quality Recommendations using Real-time Advanced Analytics and Machine Learning with Chris Fregly Data Solutions Engineer @ Databricks
  2. 2. Who am I? 2 Data Platform Engineer playboy.com Streaming Platform Engineer NetflixOSS Committer netflix.com, github.com/Netflix Data Solutions Engineer Apache Spark Contributor databricks.com, github.com/apache/spark
  3. 3. Why After Dark? Playboy After Dark Late 1960’s TV Show Progressive Show For Its Time And it rhymes!! 3
  4. 4. What is ? 4 Spark Core Spark Streaming real-timeSpark SQL structured data MLlib machine learning GraphX graph analytics …   BlinkDB approx queries
  5. 5. in Production 5
  6. 6. What is ? 6 Founded by the creators of as a Service Powerful Visualizations Collaborative Notebooks Scala/Java, Python, SQL, R Flexible Cluster Management Job Scheduling and Monitoring
  7. 7. in Production 7
  8. 8. 8 ① Generate high-quality recommendations ② Demonstrate Spark high-level libraries: ③  Spark Streaming -> Kafka, Approximates ④  Spark SQL -> DataFrames, Cassandra ①  GraphX -> PageRank, Shortest Path ①  MLlib -> Matrix Factor, Word2Vec Goals of After Dark? Images courtesy of tinder.com. Not affiliated with Tinder in any way.
  9. 9. Popular Dating Sites 9
  10. 10. Themes of this Talk 10 ① Performance ② Parallelism ③ Columnar Storage ④ Approximations ⑤ Similarity ⑥ Minimize Shuffle
  11. 11. Performance 11
  12. 12. Daytona Gray Sort Contest 12 On-disk only 250,000 partitions No in-memory caching (2014)(2013) (2014)
  13. 13. Improved Shuffle and Network Layer 13 ① Introduced sort-based shuffle Mapper maintains large buffer grouped by keys Reducer seeks directly to group and scans ② Minimizes OS resources Less mapper-reducer open files,connections ③ Netty: Async keeps CPU hot, reuse ByteBuffer ④ epoll: disk-network comm in kernel space only
  14. 14. Project Tungsten: CPU and Memory 14 ① Largest change to Spark exec engine to date ② Cache-aware data structs and sorting -> ③ Expand JVM bytecode gen, JIT optimizations ④ Custom mem manage, serializers, HashMap
  15. 15. DataFrames and Catalyst 15 15 https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/ Tip: Use DataFrames! --> JVM bytecode generation
  16. 16. Parallelism 16
  17. 17. Brady Bunch circa 1980 17 Season 5, Episode 18: “Two Petes in a Pod”
  18. 18. Parallel Algorithm : O(log n) 18 O(log n)
  19. 19. Non-parallel Algorithm : O(n) 19 O(n)
  20. 20. Columnar Storage 20
  21. 21. Columnar Storage Format 21 *Skip whole chunks with min-max heuristics stored in each chunk (sorted data only)
  22. 22. Parquet File Format 22 ① Based on Google Dremel Paper ② Implemented by Twitter and Cloudera ③ Columnar storage format ④ Optimized for fast columnar aggregations ⑤ Tight compression ⑥ Supports pushdowns ⑦ Nested, self-describing, evolving schema
  23. 23. Types of Compression 23 ① Run Length Encoding Repeated data ② Dictionary Encoding Fixed set of values ③ Delta, Prefix Encoding Sorted dataset
  24. 24. Types of Pushdowns 24 ① Column, Partition Pruning ② Row, Predicate Filtering
  25. 25. Approximations 25
  26. 26. Sketch Algorithm: Count Min Sketch 26 ①  Approximate counters ②  Better than HashMap ③  Fixed, low memory ④  Known error bounds ⑤  Large num of counters ⑥  Available in Twitter’s Algebird ⑦  Streaming example in Spark
  27. 27. Probabilistic Data Structure: HyperLogLog 27 ①  Fixed memory ②  Known error distribution ③  Measures set cardinality ④  Approx count distinct ⑤  Number of unique users ⑥  From Twitter’s Algebird ⑦  Streaming example in Spark ⑧  RDD: countApproxDistinctByKey()
  28. 28. Similarity 28
  29. 29. Types of Similarity 29 ① Euclidean: linear measure Magnitude bias ② Cosine: angle measure Adjusts for magnitude bias ③ Jaccard: set intersection divided by union Popularity bias ④ Log Likelihood Adjusts for bias -->     Ali   Matei   Reynold   Patrick   Andy   Kimberly   1   1   1   1   Paula   1 Lisa   1   Cindy   1   1   Holden   1   1   1   1   1   z
  30. 30. All-pairs Similarity 30 ① Compare everything to everything ② aka. “pair-wise similarity” or “similarity join” ③ Naïve shuffle: O(m*n^2); m=rows, n=cols ④ Minimize shuffle: reduce data size & approx Reduce m (rows) Sampling and bucketing Reduce n (cols) Remove most frequent value (0?)
  31. 31. Minimize Shuffle 31
  32. 32. Sampling Algo: DIMSUM 32 ① "Dimension Independent Matrix Square Using MR” ② Remove rows with low similarity probability ③ MLlib: RowMatrix.columnSimilarities(…) ④ Twitter: 40% efficiency gain over Cosine
  33. 33. Bucket Algo: Locality Sensitive Hashing 33 ①  Split into b buckets using similarity hash algo Requires pre-processing of data ②  Compare bucket contents in parallel ③  Converts O(m*n^2) -> O(m*n/b*b^2); m=rows, n=cols, b=buckets ④  Example: 500k x 500k matrix O(1.25E17) -> O(1.25E13); b=50 ⑤  github.com/mrsqueeze/spark-hash
  34. 34. MLlib: SparseVector vs. DenseVector 34 ①  Remove columns using sparse vectors ②  Converts O(m*n^2) -> O(m*nnz^2); nnz=num nonzeros, nnz << n Tip: Choose most frequent value … may not be 0
  35. 35. Interactive Demo! 35
  36. 36. Audience Participation Needed! 36 ① Navigate to sparkafterdark.com ② Click 3 actors and 3 actresses -> You are here ->
  37. 37. Recommendation Terminology 37 ① User User seeking likeable recommendations ② Item User who has been liked *Also a user seeking likeable recommendations! ③ Types of Feedback Explicit: Ratings, Like/Dislike Implicit: Search, Click, Hover, View, Scroll
  38. 38. Types of Recommendations 38 ① Non-personalized Cold Start No preference or behavior data for user, yet ② Personalized Items that others with similar prefs have liked User-Item Similarity Items similar to your previously-liked items Item-Item Similarity
  39. 39. Non-personalized Recommendations 39
  40. 40. Summary Statistics and Aggregations 40 ① Top Users by Like Count “I might like users with the highest sum aggregation of likes overall.” SparkSQL + DataFrame: Aggregations
  41. 41. Like Graph Analysis 41 ② Top Influencers by Like Graph “I might like users who have the highest probability of me liking them randomly while walking the like graph.” GraphX: PageRank
  42. 42. Demo! Spark SQL + DataFrames + GraphX 42
  43. 43. Personalized Recommendations 43
  44. 44. Collaborative Filtering Personalized Recs 44 ③ Like behavior of similar users “I like the same people that you like. What other people did you like that I haven’t seen?” MLlib: Matrix Factorization, User-Item Similarity
  45. 45. Text-based Personalized Recs 45 ④ Similar profiles to each other “Our profiles have similar, unique k-skip n-grams. We might like each other.” MLlib: Word2Vec, TF/IDF, Doc Similarity
  46. 46. More Text-based Personalized Recs 46 ⑤ Similar profiles from my past likes “Your profile shares a similar feature vector space to others that I’ve liked. I might like you.” MLlib: Word2Vec, TF/IDF, Doc Similarity
  47. 47. More Text-based Personalized Recs 47 ⑥ Relevant, High-Value Emails “Your initial email has similar named entities to my profile. I might like you just for making the effort.” MLlib: Word2Vec, TF/IDF, Entity Recognition ^ Her Email < My Profile
  48. 48. Demo! MLlib + ALS + Word2Vec + TF/IDF 48
  49. 49. Bonus! The Future of Recommendations 49
  50. 50. Facial Recognition 50 ⑦ Eigenfaces “Your face looks similar to others that I’ve liked. I might like you.” MLlib: RowMatrix, PCA, Item-Item Similarity Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
  51. 51. Conversation Starter Bot 51 ⑧ NLP and DecisionTrees “If your responses to my trite opening lines are positive, I might actually read your profile.” MLlib: TF/IDF, DecisionTree, Sentiment Analysis Positive responses -> Negative <- responses Image courtesty of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
  52. 52. Double Bonus! 52 Maintaining the
  53. 53. Compromise Recommendations (Couples) 53 ⑨ Similarity Pathways “I want Mad Max. You want Message In a Bottle. Let’s find something in between to watch tonight.” MLlib: RowMatrix, Item-Item Similarity GraphX: Nearest Neighbors, Shortest Path similar similar plots -> <- actors … …
  54. 54. And the Final, 54 ⑩ Personalized Recommendation
  55. 55. My Personalized Recommendation 55 ⑩ Get Off Your Computer and Be Social!! Thank you! cfregly@databricks.com @cfregly Image courtesy of http://www.duchess-france.org/

×