Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dublin Ireland Spark Meetup October 15, 2015

562 views

Published on

Spark After Dark Generating High Quality Recommendations using Real-time Advanced Analytics and Machine Learning with Spark

Published in: Software
  • Be the first to comment

Dublin Ireland Spark Meetup October 15, 2015

  1. 1. After Dark Generating High-Quality Recommendations using Real-time Advanced Analytics and Machine Learning with Chris Fregly chris@fregly.com, IBM Spark Technology Center (spark.tc)
  2. 2. Who am I? 2 Streaming Data Engineer Netflix Open Source Committer Data Solutions Engineer Apache Contributor Principal Data Solutions Engineer IBM Technology Center Meetup Organizer Advanced Apache Meetup Book Author Advanced (2016)
  3. 3. Advanced Apache Spark Meetup Total Spark Experts: ~1300 in 3 mos! Top 5 most active Spark Meetup globally! Main Goals Dig deep into the Spark & extended-Spark codebase Study integrations such as Cassandra, ElasticSearch, Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R, e tc Surface and share the patterns and idioms of these well-designed, distributed, big data components
  4. 4. Why “ After Dark”? “Playboy After Dark” Late 1960’s TV Show Progressive Show For Its Time 4 And it rhymes!!
  5. 5. What is ? 5 Core Spark Streaming real-timeSpark SQL structured data MLlib machine learning GraphX graph analytics … BlinkDB approx queries
  6. 6. Deployments in Production 6
  7. 7. Tools of this Talk 7 ① Redis ② Docker ③ Cassandra ④ MLlib, GraphX ⑤ Parquet, JSON ⑥ Apache Zeppelin ⑦ Spark Streaming, Kafka ⑧ Spark SQL, DataFrames ⑨ Spark JDBC/ODBC Hive ThriftServer ⑩ ElasticSearch, Logstash, Kibana (ELK) and…
  8. 8. SMACK Stack! 8 ① S park (Data Processing) ② M esos (Cluster Manager) ③ A kka (Actors) ④ C assandra (NoSQL) ⑤ K afka (Streaming)
  9. 9. Themes of This Talk 9 ①Parallelism ②Performance ③Streaming ④Approximations ⑤Similarity Measures ⑥Recommendations and…
  10. 10. 10 ①Generate high-quality recommendations ②Demonstrate high-level libraries: ③ Spark Streaming -> Kafka, Approximates ④ Spark SQL -> DataFrames, Cassandra ① GraphX -> PageRank, Shortest Path ① MLlib -> Matrix Factor, Word2Vec Goals of After Dark? Images courtesy of tinder.com, however not affiliated with Tinder in any way.
  11. 11. Popular Dating Sites 11
  12. 12. Parallelism 12
  13. 13. My First Experience with Parallelism 13 Brady Bunch circa 1980 Season 5, Episode 18: “Two Pete’s in a Pod”
  14. 14. Parallel Algorithm : O(log n) 14
  15. 15. Non-parallel Algorithm : O(n) 15
  16. 16. is Parallel 16
  17. 17. Performance 17
  18. 18. Daytona Gray Sort Contest 18 ① On-disk only ② 28,000 partitions ③ No in-memory caching (2014)(2013) (2014)
  19. 19. Improved Shuffle and Network Layer 19 ①“Sort-based shuffle” ②Minimize OS resources ③Switched to async Netty ④Keep CPUs hot ⑤Reuse byte buffers to minimize GC ⑥Use epoll for I/O to stay in kernel space
  20. 20. Project Tungsten: CPU and Memory 20 ①More JVM bytecode generation, JIT optimize ②CPU-cache-aware data structs and algos --> ③Custom memory management Serializers Performance HashMap
  21. 21. DataFrames and Catalyst Optimizer 21 21 https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/ Please Use DataFrames! --> --> JVM bytecode generation
  22. 22. Columnar Storage Format 22 *Skip whole chunks with min-max heuristics stored in each chunk (sorted data only)
  23. 23. Parquet File Format 23 ①Based on Google Dremel Paper ②Implemented by Twitter and Cloudera ③Columnar storage format ④Optimized for fast columnar aggregations ⑤Tight compression ⑥Supports pushdowns ⑦Nested, self-describing, evolving schema
  24. 24. Types of Compression 24 ①Run Length Encoding Repeated data ②Dictionary Encoding Fixed set of values ③Delta, Prefix Encoding Sorted dataset
  25. 25. Types of Query Optimizations 25 ①Column, Partition Pruning ②Row, Predicate Pushdown SELECT b FROM table WHERE a in [a2,a3]
  26. 26. Streaming 26
  27. 27. Direct Kafka Streaming - KafkaRDD ① No single Receiver, no Write Ahead Log (WAL) ② Workers pull from Kafka in parallel ③ Each KafkaRDD partition stores relevant offsets ④ Upon Worker Node failure, rebuild from offsets ⑤ Optimizes happy path by avoiding the WAL 27 At least once delivery guarantee <--
  28. 28. Approximations 28
  29. 29. Count Min Sketch 29 ①Approximate counters ②Better than HashMap ③Low, fixed memory ④Known error bounds ⑤Large num of counters ⑥From Twitter’s Algebird ⑦Streaming example in codebase
  30. 30. HyperLogLog 30 ①Approximate cardinality Approx count distinct ②Low memory 1.5KB @ 2% error 10^9 elements! ③From Twitter’s Algebird ④Streaming example in codebase ⑤RDD: countApproxDistinctByKey()
  31. 31. Monte Carlo Simulations 31 From Manhattan Project (A-bomb) Simulate movement of neutrons Law of Large Numbers (LLN) Average of results of many trials Converge on expected value SparkPi example in codebase Pi ~ # red dots / # total dots * 4
  32. 32. Recommendations 32
  33. 33. Interactive Demo! 33
  34. 34. Audience Participation Needed! 34 ①Navigate to sparkafterdark.com ②Click 3 actors and 3 actresses -> You are here ->
  35. 35. Types of Recommendations 35 Non-personalized Cold Start No preference or behavior data for user, yet Personalized User-Item Similarity Items that others with similar prefs have liked Item-Item Similarity Items similar to your previously-liked items
  36. 36. Non-personalized Recommendations 36
  37. 37. Summary Statistics and Aggregations 37 ①Top Users by Like Count “I might like users with the highest sum aggregation of likes overall.” SparkSQL + DataFrame: Aggregations
  38. 38. Like Graph Analysis 38 ②Top Influencers by Like Graph “I might like users who have the highest probability of me liking them randomly while walking the like graph.” GraphX: PageRank
  39. 39. Demo! Spark SQL + DataFrames + GraphX + Hive ThriftServer 39
  40. 40. Finding Similarities 40
  41. 41. Types of Similarity 41 Euclidean: linear measure Magnitude bias Cosine: angle measure Adjust for magnitude bias Jaccard: (intersection / union) Popularity bias Log Likelihood Adjust for popularity bias Ali Matei Reynold Patrick Andy Kimberly 1 1 1 1 Leslie 1 1 Meredith 1 1 1 Lisa 1 1 1 Holden 1 1 1 1 1 z
  42. 42. All-Pairs Similarity Comparison 42 Compare everything to everything aka. “pair-wise similarity” or “similarity join” Naïve shuffle: O(m*n^2); m=rows, n=cols Must Minimize shuffle through approximations Reduce m (rows) Sampling and bucketing Reduce n (cols): Remove most frequent value (ie.0)
  43. 43. Reduce m: DIMSUM Sampling 43 Dimension Independent Matrix Square Using MR Remove rows with low similarity probability MLlib: RowMatrix.columnSimilarities(…) Twitter: 40% efficiency gain over Cosine
  44. 44. Reduce m: LSH Bucketing 44 Locality Sensitive Hashing Split m into b buckets Use similarity hash algo Requires pre-processing of data Compare bucket contents in parallel Converts O(m*n^2) -> O(m*n/b*b^2); m=rows, n=cols, b=buckets 500k x 500k matrix O(1.25E17) -> O(1.25E13); b=50 github.com/mrsqueeze/spark-hash
  45. 45. Reduce n: Remove Most Frequent Value 45 Eliminate most-frequent value Represent other values with (index,value) pairs Converts O(m*n^2) -> O(m*nnz^2); nnz=num nonzeros, nnz << n Choose most frequent value – may not be zero! (index,value) (index,value)
  46. 46. Personalized Recommendations 46
  47. 47. Terminology of Recommendations 47 User User seeking recommendations Item Item that has been liked or rated Feedback Explicit: like, rating Implicit: search, click, hover, view, scroll
  48. 48. Collaborative Filtering Personalized Recs 48 ③Like behavior of similar users “I like the same people that you like. What other people did you like that I haven’t seen?” MLlib: Matrix Factorization, User-Item Similarity
  49. 49. Demo! Spark SQL + DataFrames + MLlib 49
  50. 50. Text-based Personalized Recs 50 ④Similar profiles to me “Our profiles have similar, unique k-skip n-grams. We might like each other.” MLlib: Word2Vec, TF/IDF, Doc Similarity
  51. 51. More Text-based Personalized Recs 51 ⑤Similar profiles from my past likes “Your profile shares a similar feature vector space to others that I’ve liked. I might like you.” MLlib: Word2Vec, TF/IDF, Doc Similarity
  52. 52. More Text-based Personalized Recs 52 ⑥Relevant, High-Value Emails “Your initial email has similar named entities to my profile. I might like you just for making the effort.” MLlib: Word2Vec, TF/IDF, Entity Recognition ^ Her Email< My Profile
  53. 53. The Future of Personalized Recommendations 53
  54. 54. Facial Recognition 54 ⑦Eigenfaces “Your face looks similar to others that I’ve liked. I might like you.” MLlib: RowMatrix, PCA, Item-Item Similarity Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
  55. 55. Conversation Bot 55 ⑧NLP and DecisionTrees “If your responses to my trite opening lines are positive, I may read your profile.” MLlib: TF/IDF, DecisionTree, Sentiment Analysis Positive Negative Image courtesty of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
  56. 56. 56 Maintaining the
  57. 57. Couples’ Recommendations 57 ⑨Pathways of Similarity “I want Mad Max. You want Message In a Bottle. Let’s find something in between to watch tonight.” MLlib: RowMatrix, Item-Item Similarity GraphX: Nearest Neighbors, Shortest Path similar similar plots -> <- actors
  58. 58. 58 Final Recommendation
  59. 59. ⑩ Get Off The Computer and Meet People! chris@fregly.com @cfregly IBM Spark Technology Center (spark.tc) advancedspark.com github.com/fluxcapacitor/pipeline hub.docker.com/r/fluxcapacitor/pipeline/ 59 Thank you!! Image courtesy of http://www.duchess-france.org/

×