Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark

5,404 views

Published on

Generating high quality dating recommendations using advanced analytics, streaming data pipelines, machine learning, graph analytics, and text processing.

Use the latest Spark libraries including Spark SQL, Data Frames, BlinkDB, Spark Streaming, MLlib, and GraphX as well as Twitter's Algebird for sketch algorithms, probabilistic data structures, and approximations.

Published in: Software

Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark

  1. 1. After Dark Generating High-Quality Recommendations using Real-time Advanced Analytics and Machine Learning with Chris Fregly chris@fregly.com
  2. 2. Who am I? Streaming Platform Engineer Streaming Data Engineer Netflix Open Source Committer Data Solutions Engineer Apache Spark Contributor Spark Author Consultant, Trainer 2 advancedspark.com
  3. 3. Why After Dark? Playboy After Dark Late 1960’s TV Show Progressive Show For Its Time And it rhymes!! 3
  4. 4. What is ? 4 Spark Core Spark Streaming real-timeSpark SQL structured data MLlib machine learning GraphX graph analytics … BlinkDB approx queries
  5. 5. in Production 5
  6. 6. What is ? 6 Founded by the creators of as a Service Amazon AWS based Powerful Visualizations Collaborative Notebooks Scala/Java, Python, SQL, R Flexible Cluster Management Job Scheduling and Monitoring
  7. 7. 7 ①Generate high-quality recommendations ②Demonstrate Spark high-level libraries: ③ Spark Streaming -> Kafka, Approximates ④ Spark SQL -> DataFrames, Cassandra ① GraphX -> PageRank, Shortest Path ① MLlib -> Matrix Factor, Word2Vec Goals of After Dark? Images courtesy of tinder.com. Not affiliated with Tinder in any way!
  8. 8. Popular Dating Sites 8
  9. 9. Focus of This Talk 9 ①Parallelism ②Performance ③Real-time Streaming ④Approximations ⑤Similarity Measures Spark and…
  10. 10. Parallelism 10
  11. 11. Brady Bunch circa 1980 11 Season 5, Episode 18: “Two Petes in a Pod”
  12. 12. Parallel Algorithm : O(log n) 12
  13. 13. Non-parallel Algorithm : O(n) 13
  14. 14. Spark is Parallel 14
  15. 15. Performance 15
  16. 16. Daytona Gray Sort Contest 16 On-disk only 250,000 partitions No in-memory caching (2014)(2013) (2014)
  17. 17. Improved Shuffle and Network Layer 17 ①“Sort-based shuffle” ②Minimize OS resources ③Switched to async Netty ④Keep CPUs hot ⑤Reuse byte buffers to minimize GC ⑥Use epoll for I/O to stay in kernel space
  18. 18. Project Tungsten: CPU and Memory 18 ①More JVM bytecode generation, JIT optimize ②CPU-cache-aware data structs and algos -> ③Custom memory management Serializers HashMap
  19. 19. DataFrames and Catalyst 19 19 https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/ Please Use DataFrames!! --> JVM bytecode generation
  20. 20. Columnar Storage Format 20 *Skip whole chunks with min-max heuristics stored in each chunk (sorted data only)
  21. 21. Parquet File Format 21 ①Based on Google Dremel Paper ②Implemented by Twitter and Cloudera ③Columnar storage format ④Optimized for fast columnar aggregations ⑤Tight compression ⑥Supports pushdowns ⑦Nested, self-describing, evolving schema
  22. 22. Types of Compression 22 ①Run Length Encoding Repeated data ②Dictionary Encoding Fixed set of values ③Delta, Prefix Encoding Sorted dataset
  23. 23. Types of Pushdowns 23 ①Column, Partition Pruning ②Row, Predicate Filtering
  24. 24. Real-time Streaming 24
  25. 25. Direct Kafka Streaming (KafkaRDD) ① No single Receiver, no Write Ahead Log (WAL) ② Workers pull from Kafka in parallel ③ Each KafkaRDD partition stores relevant offsets ④ Upon Worker Node failure, rebuild from offsets ⑤ Optimizes happy path by avoiding the WAL 25 At least once delivery guarantee <--
  26. 26. Approximations 26
  27. 27. Count Min Sketch 27 ① Approximate counters ② Better than HashMap ③ Low, fixed memory ④ Known error bounds ⑤ Large num of counters ⑥ Available in Twitter’s Algebird ⑦ Streaming example in Spark codebase
  28. 28. HyperLogLog 28 ① Measures set cardinality Approx count distinct ② Low memory 1.5KB @ 2% error 10^9 elements! ③ From Twitter’s Algebird ④ Streaming example in Spark codebase ⑤ RDD: countApproxDistinctByKey()
  29. 29. 10 Recommendations 29
  30. 30. Types of Recommendations 30 ①Non-personalized (2 out of 10) Cold Start No preference or behavior data for user, yet ②Personalized (8 out of 10) User-Item Similarity Items that others with similar prefs have liked Item-Item Similarity
  31. 31. Interactive Demo! 31
  32. 32. Audience Participation Needed! 32 ①Navigate to sparkafterdark.com ②Click 3 actors and 3 actresses -> You are here ->
  33. 33. Non-personalized Recommendations 33
  34. 34. Summary Statistics and Aggregations 34 ①Top Users by Like Count “I might like users with the highest sum aggregation of likes overall.” SparkSQL + DataFrame: Aggregations
  35. 35. Like Graph Analysis 35 ②Top Influencers by Like Graph “I might like users who have the highest probability of me liking them randomly while walking the like graph.” GraphX: PageRank
  36. 36. Demo! Spark SQL + DataFrames + GraphX 36
  37. 37. Similarity Measures 37
  38. 38. Types of Similarity 38 ①Euclidean: linear measure Magnitude bias ②Cosine: angle measure Adjust for magnitude bias ③Jaccard: Set intersection divided by union Popularity bias ④Log Likelihood Adjust for pop. bias Ali Matei Reynold Patrick Andy Kimberly 1 1 1 1 Leslie 1 1 Meredith 1 1 1 Lisa 1 1 1 Holden 1 1 1 1 1 z
  39. 39. All-pairs Similarity Measure 39 ①Compare everything to everything ②aka. “pair-wise similarity” or “similarity join” ③Naïve shuffle: O(m*n^2); m=rows, n=cols ④Minimize shuffle: reduce data size & approx Reduce m (rows) Sampling and bucketing Reduce n (cols) Remove most frequent value (0?)
  40. 40. Sampling Algo: DIMSUM 40 ①"Dimension Independent Matrix Square Using MR” ②Remove rows with low similarity probability ③MLlib: RowMatrix.columnSimilarities(…) ④Twitter: 40% efficiency gain over Cosine
  41. 41. Bucket Algo: Locality Sensitive Hashing 41 ① Split into b buckets using similarity hash algo Requires pre-processing of data ② Compare bucket contents in parallel ③ Converts O(m*n^2) -> O(m*n/b*b^2); m=rows, n=cols, b=buckets ④ Example: 500k x 500k matrix O(1.25E17) -> O(1.25E13); b=50 ⑤ github.com/mrsqueeze/spark-hash
  42. 42. MLlib: SparseVector vs. DenseVector 42 ① Remove columns using sparse vectors ② Converts O(m*n^2) -> O(m*nnz^2); nnz=num nonzeros, nnz << n Tip: Choose most frequent value … may not be 0
  43. 43. Personalized Recommendations 43
  44. 44. Personalized Recommendation Terms 44 ①User User seeking likeable recommendations ②Item User who has been liked *Also a user seeking likeable recommendations! ③Types of Feedback Explicit: rating, like Implicit: search, click, hover, view, scroll
  45. 45. Collaborative Filtering Personalized Recs 45 ③Like behavior of similar users “I like the same people that you like. What other people did you like that I haven’t seen?” MLlib: Matrix Factorization, User-Item Similarity
  46. 46. Text-based Personalized Recs 46 ④Similar profiles to each other “Our profiles have similar, unique k-skip n-grams. We might like each other.” MLlib: Word2Vec, TF/IDF, Doc Similarity
  47. 47. More Text-based Personalized Recs 47 ⑤Similar profiles from my past likes “Your profile shares a similar feature vector space to others that I’ve liked. I might like you.” MLlib: Word2Vec, TF/IDF, Doc Similarity
  48. 48. More Text-based Personalized Recs 48 ⑥Relevant, High-Value Emails “Your initial email has similar named entities to my profile. I might like you just for making the effort.” MLlib: Word2Vec, TF/IDF, Entity Recognition ^ Her Email< My Profile
  49. 49. Personalized Recommendations: The Future 49
  50. 50. Facial Recognition 50 ⑦Eigenfaces “Your face looks similar to others that I’ve liked. I might like you.” MLlib: RowMatrix, PCA, Item-Item Similarity Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
  51. 51. Conversation Starter Bot 51 ⑧NLP and DecisionTrees “If your responses to my trite opening lines are positive, I might actually read your profile.” MLlib: TF/IDF, DecisionTree, Sentiment Analysis Positive response -> Negative <- response Image courtesty of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
  52. 52. 52 Maintaining the
  53. 53. Compromise Recommendations (Couples) 53 ⑨Pathway of Similarity “I want Mad Max. You want Message In a Bottle. Let’s find something in between to watch tonight.” MLlib: RowMatrix, Item-Item Similarity GraphX: Nearest Neighbors, Shortest Path similar similar plots -> <- actors … …
  54. 54. 54 ⑩ The Final Recommendation
  55. 55. ⑩ Get Off The Computer and Meet People! linkedin.com/in/cfregly github.com/cfregly chris@fregly.com @cfregly 55 Thank you! Image courtesy of http://www.duchess-france.org/ Free trial at databricks.com Try !!

×