Successfully reported this slideshow.

IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Quality Dating Recommendations Using Advanced Real Time Analytics

2

Share

Upcoming SlideShare
Programming the Semantic Web
Programming the Semantic Web
Loading in …3
×
1 of 55
1 of 55

IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Quality Dating Recommendations Using Advanced Real Time Analytics

2

Share

Download to read offline

Spark After Dark is a mock dating site that uses the latest Spark libraries including Spark SQL, BlinkDB, Spark Streaming, MLlib, and GraphX to generate high-quality dating recommendations for its members and blazing fast analytics for its operators. We begin with brief overview of Spark, Spark Libraries, and Spark Use Cases. In addition, we'll discuss the modern day Lambda Architecture that combines real-time and batch processing into a single system. Lastly, we present best practices for monitoring and tuning a highly-available Spark and Spark Streaming cluster. There will be many live demos covering everything from basic topics such as ETL and data ingestion to advanced topics such as streaming, sampling, approximations, machine learning, textual analysis, and graph processing.

Spark After Dark is a mock dating site that uses the latest Spark libraries including Spark SQL, BlinkDB, Spark Streaming, MLlib, and GraphX to generate high-quality dating recommendations for its members and blazing fast analytics for its operators. We begin with brief overview of Spark, Spark Libraries, and Spark Use Cases. In addition, we'll discuss the modern day Lambda Architecture that combines real-time and batch processing into a single system. Lastly, we present best practices for monitoring and tuning a highly-available Spark and Spark Streaming cluster. There will be many live demos covering everything from basic topics such as ETL and data ingestion to advanced topics such as streaming, sampling, approximations, machine learning, textual analysis, and graph processing.

More Related Content

More from In-Memory Computing Summit

Related Books

Free with a 14 day trial from Scribd

See all

IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Quality Dating Recommendations Using Advanced Real Time Analytics

  1. 1. After Dark Generating High-Quality Recommendations using Real-time Advanced Analytics and Machine Learning with Chris Fregly Data Solutions Engineer @ Databricks
  2. 2. Who am I? 2 Data Platform Engineer playboy.com Streaming Platform Engineer NetflixOSS Committer netflix.com, github.com/Netflix Data Solutions Engineer Apache Spark Contributor databricks.com, github.com/apache/spark
  3. 3. Why After Dark? Playboy After Dark Late 1960’s TV Show Progressive Show For Its Time And it rhymes!! 3
  4. 4. What is ? 4 Spark Core Spark Streaming real-timeSpark SQL structured data MLlib machine learning GraphX graph analytics …   BlinkDB approx queries
  5. 5. in Production 5
  6. 6. What is ? 6 Founded by the creators of as a Service Powerful Visualizations Collaborative Notebooks Scala/Java, Python, SQL, R Flexible Cluster Management Job Scheduling and Monitoring
  7. 7. in Production 7
  8. 8. 8 ① Generate high-quality recommendations ② Demonstrate Spark high-level libraries: ③  Spark Streaming -> Kafka, Approximates ④  Spark SQL -> DataFrames, Cassandra ①  GraphX -> PageRank, Shortest Path ①  MLlib -> Matrix Factor, Word2Vec Goals of After Dark? Images courtesy of tinder.com. Not affiliated with Tinder in any way.
  9. 9. Popular Dating Sites 9
  10. 10. Themes of this Talk 10 ① Performance ② Parallelism ③ Columnar Storage ④ Approximations ⑤ Similarity ⑥ Minimize Shuffle
  11. 11. Performance 11
  12. 12. Daytona Gray Sort Contest 12 On-disk only 250,000 partitions No in-memory caching (2014)(2013) (2014)
  13. 13. Improved Shuffle and Network Layer 13 ① Introduced sort-based shuffle Mapper maintains large buffer grouped by keys Reducer seeks directly to group and scans ② Minimizes OS resources Less mapper-reducer open files,connections ③ Netty: Async keeps CPU hot, reuse ByteBuffer ④ epoll: disk-network comm in kernel space only
  14. 14. Project Tungsten: CPU and Memory 14 ① Largest change to Spark exec engine to date ② Cache-aware data structs and sorting -> ③ Expand JVM bytecode gen, JIT optimizations ④ Custom mem manage, serializers, HashMap
  15. 15. DataFrames and Catalyst 15 15 https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/ Tip: Use DataFrames! --> JVM bytecode generation
  16. 16. Parallelism 16
  17. 17. Brady Bunch circa 1980 17 Season 5, Episode 18: “Two Petes in a Pod”
  18. 18. Parallel Algorithm : O(log n) 18 O(log n)
  19. 19. Non-parallel Algorithm : O(n) 19 O(n)
  20. 20. Columnar Storage 20
  21. 21. Columnar Storage Format 21 *Skip whole chunks with min-max heuristics stored in each chunk (sorted data only)
  22. 22. Parquet File Format 22 ① Based on Google Dremel Paper ② Implemented by Twitter and Cloudera ③ Columnar storage format ④ Optimized for fast columnar aggregations ⑤ Tight compression ⑥ Supports pushdowns ⑦ Nested, self-describing, evolving schema
  23. 23. Types of Compression 23 ① Run Length Encoding Repeated data ② Dictionary Encoding Fixed set of values ③ Delta, Prefix Encoding Sorted dataset
  24. 24. Types of Pushdowns 24 ① Column, Partition Pruning ② Row, Predicate Filtering
  25. 25. Approximations 25
  26. 26. Sketch Algorithm: Count Min Sketch 26 ①  Approximate counters ②  Better than HashMap ③  Fixed, low memory ④  Known error bounds ⑤  Large num of counters ⑥  Available in Twitter’s Algebird ⑦  Streaming example in Spark
  27. 27. Probabilistic Data Structure: HyperLogLog 27 ①  Fixed memory ②  Known error distribution ③  Measures set cardinality ④  Approx count distinct ⑤  Number of unique users ⑥  From Twitter’s Algebird ⑦  Streaming example in Spark ⑧  RDD: countApproxDistinctByKey()
  28. 28. Similarity 28
  29. 29. Types of Similarity 29 ① Euclidean: linear measure Magnitude bias ② Cosine: angle measure Adjusts for magnitude bias ③ Jaccard: set intersection divided by union Popularity bias ④ Log Likelihood Adjusts for bias -->     Ali   Matei   Reynold   Patrick   Andy   Kimberly   1   1   1   1   Paula   1 Lisa   1   Cindy   1   1   Holden   1   1   1   1   1   z
  30. 30. All-pairs Similarity 30 ① Compare everything to everything ② aka. “pair-wise similarity” or “similarity join” ③ Naïve shuffle: O(m*n^2); m=rows, n=cols ④ Minimize shuffle: reduce data size & approx Reduce m (rows) Sampling and bucketing Reduce n (cols) Remove most frequent value (0?)
  31. 31. Minimize Shuffle 31
  32. 32. Sampling Algo: DIMSUM 32 ① "Dimension Independent Matrix Square Using MR” ② Remove rows with low similarity probability ③ MLlib: RowMatrix.columnSimilarities(…) ④ Twitter: 40% efficiency gain over Cosine
  33. 33. Bucket Algo: Locality Sensitive Hashing 33 ①  Split into b buckets using similarity hash algo Requires pre-processing of data ②  Compare bucket contents in parallel ③  Converts O(m*n^2) -> O(m*n/b*b^2); m=rows, n=cols, b=buckets ④  Example: 500k x 500k matrix O(1.25E17) -> O(1.25E13); b=50 ⑤  github.com/mrsqueeze/spark-hash
  34. 34. MLlib: SparseVector vs. DenseVector 34 ①  Remove columns using sparse vectors ②  Converts O(m*n^2) -> O(m*nnz^2); nnz=num nonzeros, nnz << n Tip: Choose most frequent value … may not be 0
  35. 35. Interactive Demo! 35
  36. 36. Audience Participation Needed! 36 ① Navigate to sparkafterdark.com ② Click 3 actors and 3 actresses -> You are here ->
  37. 37. Recommendation Terminology 37 ① User User seeking likeable recommendations ② Item User who has been liked *Also a user seeking likeable recommendations! ③ Types of Feedback Explicit: Ratings, Like/Dislike Implicit: Search, Click, Hover, View, Scroll
  38. 38. Types of Recommendations 38 ① Non-personalized Cold Start No preference or behavior data for user, yet ② Personalized Items that others with similar prefs have liked User-Item Similarity Items similar to your previously-liked items Item-Item Similarity
  39. 39. Non-personalized Recommendations 39
  40. 40. Summary Statistics and Aggregations 40 ① Top Users by Like Count “I might like users with the highest sum aggregation of likes overall.” SparkSQL + DataFrame: Aggregations
  41. 41. Like Graph Analysis 41 ② Top Influencers by Like Graph “I might like users who have the highest probability of me liking them randomly while walking the like graph.” GraphX: PageRank
  42. 42. Demo! Spark SQL + DataFrames + GraphX 42
  43. 43. Personalized Recommendations 43
  44. 44. Collaborative Filtering Personalized Recs 44 ③ Like behavior of similar users “I like the same people that you like. What other people did you like that I haven’t seen?” MLlib: Matrix Factorization, User-Item Similarity
  45. 45. Text-based Personalized Recs 45 ④ Similar profiles to each other “Our profiles have similar, unique k-skip n-grams. We might like each other.” MLlib: Word2Vec, TF/IDF, Doc Similarity
  46. 46. More Text-based Personalized Recs 46 ⑤ Similar profiles from my past likes “Your profile shares a similar feature vector space to others that I’ve liked. I might like you.” MLlib: Word2Vec, TF/IDF, Doc Similarity
  47. 47. More Text-based Personalized Recs 47 ⑥ Relevant, High-Value Emails “Your initial email has similar named entities to my profile. I might like you just for making the effort.” MLlib: Word2Vec, TF/IDF, Entity Recognition ^ Her Email < My Profile
  48. 48. Demo! MLlib + ALS + Word2Vec + TF/IDF 48
  49. 49. Bonus! The Future of Recommendations 49
  50. 50. Facial Recognition 50 ⑦ Eigenfaces “Your face looks similar to others that I’ve liked. I might like you.” MLlib: RowMatrix, PCA, Item-Item Similarity Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
  51. 51. Conversation Starter Bot 51 ⑧ NLP and DecisionTrees “If your responses to my trite opening lines are positive, I might actually read your profile.” MLlib: TF/IDF, DecisionTree, Sentiment Analysis Positive responses -> Negative <- responses Image courtesty of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
  52. 52. Double Bonus! 52 Maintaining the
  53. 53. Compromise Recommendations (Couples) 53 ⑨ Similarity Pathways “I want Mad Max. You want Message In a Bottle. Let’s find something in between to watch tonight.” MLlib: RowMatrix, Item-Item Similarity GraphX: Nearest Neighbors, Shortest Path similar similar plots -> <- actors … …
  54. 54. And the Final, 54 ⑩ Personalized Recommendation
  55. 55. My Personalized Recommendation 55 ⑩ Get Off Your Computer and Be Social!! Thank you! cfregly@databricks.com @cfregly Image courtesy of http://www.duchess-france.org/

×