Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Real time, streaming advanced analytics, approximations, and recommendations using apache spark ml-graph x, kafka stanford corenlp, and twitter

1,056 views

Published on

Real time, streaming advanced analytics, approximations, and recommendations using apache spark ml-graph x, kafka stanford corenlp, and twitter

Published in: Technology
  • Be the first to comment

Real time, streaming advanced analytics, approximations, and recommendations using apache spark ml-graph x, kafka stanford corenlp, and twitter

  1. 1. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI advancedspark.com
  2. 2. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Who Am I? 2 Streaming Data Engineer Netflix OSS Committer Data Solutions Engineer Apache Contributor Principal Data Solutions Engineer IBM Technology Center Meetup Organizer Advanced Apache Meetup Book Author Advanced . Due 2016
  3. 3. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Advanced Apache Spark Meetup http://advancedspark.com Meetup Metrics Top 10 Most-active Spark Meetup! 3200+ Members in just 9 mos!! 3700+ Docker downloads (demos) Meetup Mission Code deep-dive into Spark and related open source projects Surface key patterns and idioms Focus on distributed systems, scale, and performance 3
  4. 4. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Live, Interactive Demo! Audience Participation Required!! Cell Phone Compatible!!! demo.advancedspark.com 4
  5. 5. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI http://demo.advancedspark.com End User -> ElasticSearch -> Spark ML -> Data Scientist -> 5 <- Kafka <- Spark Streaming <- Cassandra, Redis <- Zeppelin, iPython
  6. 6. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Presentation Outline ① Scaling ② Similarities ③ Recommendations ④ Approximations ⑤ Netflix Recommendations 6
  7. 7. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Scaling with Parallelism 7 Peter O(log n) O(log n) Worker Nodes
  8. 8. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Parallelism with Composability Worker 1 Worker 2 Max (a max b max c max d) == (a max b) max (c max d) Set Union (a U b U c U d) == (a U b) U (c U d) Addition (a + b + c + d) == (a + b) + (c + d) Multiply (a * b * c * d) == (a * b) * (c * d) 8 What about Division and Average? Collect at Driver
  9. 9. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI What about Division? Division (a / b / c / d) != (a / b) / (c / d) (3 / 4 / 7 / 8) != (3 / 4) / (7 / 8) (((3 / 4) / 7) / 8) != ((3 * 8) / (4 * 7)) 0.134 != 0.857 9 What were the Egyptians thinking?! Not Composable “Divide like an Egyptian”
  10. 10. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI What about Average? Overall AVG (3, 1) (3 + 5 + 5 + 7) 20 + (5, 1) == -------------------- == --- == 5 + (5, 1) (1 + 1 + 1 + 1) 4 + (7, 1) 10 values counts Pairwise AVG (3 + 5) (5 + 7) 8 12 20 ------- + ------- == --- + --- == --- == 10 != 5 2 2 2 2 2 Divide, Add, Divide? Not Composable Single-Node Divide at the End? Doesn’t need to be Composable! AVG (3, 5, 5, 7) == 5 Add, Add, Add? Composable!
  11. 11. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Presentation Outline ① Scaling ② Similarities ③ Recommendations ④ Approximations ⑤ Netflix Recommendations 11
  12. 12. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Similarities 12
  13. 13. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Euclidean Similarity Exists in Euclidean, flat space Based on Euclidean distance Linear measure Bias towards magnitude 13
  14. 14. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Cosine Similarity Angular measure Adjusts for Euclidean magnitude bias Normalize to unit vectors in all dimensions Used with real-valued vectors (versus binary) 14 org.jblas. DoubleMatrix
  15. 15. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Jaccard Similarity Set similarity measurement Set intersection / set union Bias towards popularity Works with binary vectors 15
  16. 16. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Log Likelihood Similarity Adjusts for popularity bias Netflix “Shawshank” problem 16
  17. 17. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Word Similarity Edit Distance Misspellings and autocorrect Word2Vec Similar words are defined by similar contexts in vector space 17 English Spanish
  18. 18. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Demo! Find Synonyms with Word2Vec 18
  19. 19. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Find Synonyms using Word2Vec 19
  20. 20. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Document Similarity TF/IDF Term Freq / Inverse Document Freq Used by most search engines Doc2Vec Similar documents are determined by similar contexts 20
  21. 21. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Bonus! Text Rank Document Summary Text Rank (aka Sentence Rank) Surface summary sentences TF/IDF + Similarity Graph + PageRank Most similar sentence to all other sentences TF/IDF + Similarity Graph Most influential sentences PageRank 21
  22. 22. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Similarity Pathways (Recommendations) Best recommendations for 2 (or more) people “You like Max Max. I like Message in a Bottle. We might like a movie similar to both.” Item-to-Item Similarity Graph + Dijkstra Heaviest Path 22
  23. 23. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Demo! Similarity Pathway for Movie Recommendations 23
  24. 24. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Load Movies with Tags into DataFrame 24 My Choice Their Choice
  25. 25. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Item-to-Item Tag Jaccard Similarity Based on Tags 25 Calculate Jaccard Similarity (Tag Set Similarity) Must be Above the Given Jaccard Similarity Threshold
  26. 26. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Item-to-Item Tag Similarity Graph 26 Edge Weights == Jaccard Similarity (Based on Tag Sets)
  27. 27. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Use Dijkstra to Find Heaviest Pathway 27
  28. 28. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Calculating Exact Similarity Brute-Force Similarity Cartesian Product O(n^2) shuffle and compute aka. All-pairs, Pair-wise, Similarity Join 28
  29. 29. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Calculating Approximate Similarity Goal: Reduce Shuffle Approximate Similarity Sampling Bucketing or Clustering Ignore low-similarity probability Locality Sensitive Hashing Twitter Algebird MinHash 29 Bucket By Genre
  30. 30. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Presentation Outline ① Scaling ② Similarities ③ Recommendations ④ Approximations ① Netflix Recommendations 30
  31. 31. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Recommendations 31
  32. 32. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Basic Terminology User: User seeking recommendations Item: Item being recommended Explicit User Feedback: user knows they are rating or liking, can choose to dislike Implicit User Feedback: user not explicitly aware, cannot dislike (click, hover, etc) Instances: Rows of user feedback/input data Overfitting: Training a model too closely to the training data & hyperparameters Hold Out Split: Holding out some of the instances to avoid overfitting Features: Columns of instance rows (of feedback/input data) Cold Start Problem: Not enough data to personalize (new) Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations) Model Evaluation: Compare predictions to actual values of hold out split Feature Engineering: Modify, reduce, combine features Loss Function: Function we’re trying to minimize such as least-squared error for Linear Regression Cross Entropy: Loss function used for classification algorithms such as Logistic Regression Optimizer: Technique to optimize loss function such as Stochastic Gradient Descent (SGD) 32
  33. 33. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Stochastic Gradient Descent (SGD) Optimizes Loss Function Least Squared Error b/w predicted and actual value Cross Entropy Log Likelihood b/w predicted and actual probability 33 2-Dimensional 3-Dimensional
  34. 34. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Features Binary: True or False Numeric Discrete: Integers Numeric: Real Values Binning: Convert Continuous into Discrete (Time of Day->Morning, Afternoon) Categorical Ordinal: Size (Small->Medium->Large), Ratings (1->5) Categorical Nominal: Independent, Favorite Sports Teams, Dating Spots Temporal: Time-based, Time of Day, Binge Viewing Text: Movie Titles, Genres, Tags, Reviews (Tokenize, Stop Words, Stemming) Media: Images, Audio, Video Geographic: (Longitude, Latitude), Geohash Latent: Hidden Features within Data (Collaborative Filtering) Derived: Age of Movie, Duration of User Subscription 34
  35. 35. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Feature Engineering Dimension Reduction Reduce number of features in feature space Principle Component Analysis (PCA) Find principle features that best describe data variance Peel dimensional layers back One-Hot Encoding Convert nominal categorical feature values into 0’s and 1’s Remove any numerical relationship between categories Bears -> 1 Bears -> [1.0, 0.0, 0.0] 49’ers -> 2 --> 49’ers -> [0.0, 1.0, 0.0] Steelers-> 3 Steelers-> [0.0, 0.0, 1.0] 35 Convert Each Item to Binary Vector with Single 1.0 Column
  36. 36. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Feature Normalization & Standardization Goal Scale features to standard size Prevent boundless features Helps avoid overfitting Required by many ML algos Normalize Features Calculate L1 (or L2, etc) norm, then divide into each element Standardize Features Apply standard normal transformation (mean->0, stddev->1) org.apache.spark.ml.feature.[Normalizer, StandardScaler] 36 http://www.mathsisfun.com/data/standard-normal-distribution.html
  37. 37. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Non-Personalized Recommendations 37
  38. 38. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Cold Start Problem “Cold Start” problem New user, don’t know their preferences, must show something! Movies with highest-rated actors Top K aggregations Facebook social graph Friend-based recommendations Most desirable singles PageRank of likes and dislikes 38
  39. 39. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Demo! GraphFrame PageRank 39
  40. 40. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Example: Dating Site “Like” Graph 40
  41. 41. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI PageRank of Top Influencers 41
  42. 42. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Personalized Recommendations 42
  43. 43. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Demo! Personalized PageRank 43
  44. 44. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Personalized PageRank: Outbound Links 44 0.15 = (1 - 0.85 “Damping Factor”) 85% Probability: Choose Among Outbound Network 15% Probability: Choose Self or Random 85% Among Outbound Network
  45. 45. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Personalized PageRank: No Outbound 45 0.15 = (1 - 0.85 “Damping Factor”) 85% Probability: Choose Among Outbound Network 15% Probability: Choose Self or Random 85% Among No Outbound Network!!
  46. 46. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI User-to-User Clustering User Similarity Time-based Pattern of viewing (binge or casual) Time of viewing (am or pm) Ratings-based Content ratings or number of views Average rating relative to others (critical or lenient) Search-based Search terms 46
  47. 47. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Item-to-Item Clustering Item Similarity Profile text (TF/IDF, Word2Vec, NLP) Categories, tags, interests (Jaccard Similarity, LSH) Images, facial structures (Neural Nets, Eigenfaces) Dating Site Example… 47 Cluster Similar Eigen-facesCluster Similar Profiles Cluster Similar Categories
  48. 48. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Bonus: NLP Conversation Starter Bot 48 “If your responses to my generic opening lines are positive, I may read your profile.” Spark ML, Stanford CoreNLP, TF/IDF, DecisionTrees, Sentiment http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
  49. 49. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Bonus: Demo! Spark + Stanford CoreNLP Sentiment Analysis 49
  50. 50. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Bonus: Top 100 Country Song Sentiment 50
  51. 51. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Bonus: Surprising Results…?! 51
  52. 52. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Item-to-Item Based Recommendations Based on Metadata: Genre, Description, Cast, City 52
  53. 53. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Demo! Item-to-Item-based Recommendations One-Hot Encoding + K-Means Clustering 53
  54. 54. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI One-Hot Encode Tag Feature Vectors 54
  55. 55. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Cluster Movie Tag Feature Vectors 55 Hyperparameter Tuning (K Clusters?)
  56. 56. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Analyze Movie Tag Clusters 56
  57. 57. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI User-to-Item Collaborative Filtering Matrix Factorization ① Factor the large matrix (left) into 2 smaller matrices (right) ② Lower-rank matrices approximate original when multiplied ③ Fill in the missing values of the large matrix ④ Surface k (rank) latent features from user-item interactions 57
  58. 58. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Item-to-Item Collaborative Filtering Famous Amazon Paper circa 2003 Problem As users grew, user-to-item collaborative filtering didn’t scale Solution Item-to-item similarity, nearest neighbors Offline (Batch) Generate itemId->List[userId] vectors Online (Real-time) From cart, recommend nearest-neighbors in vector space 58
  59. 59. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Demo! Collaborative Filtering-based Recommendations 59
  60. 60. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Fitting the Matrix Factorization Model 60
  61. 61. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Show ItemFactors Matrix from ALS 61
  62. 62. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Show UserFactors Matrix from ALS 62
  63. 63. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Generating Individual Recommendations 63
  64. 64. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Generating Batch Recommendations 64
  65. 65. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Clustering + Collaborative Filtering Recs Cluster matrix output from Matrix Factorization Latent features derived from user-item interaction Item-to-Item Similarity Cluster item-factor matrix-> User-to-User Similarity <-Cluster user-factor matrix 65
  66. 66. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Demo! Clustering + Collaborative Filtering-based Recommendations 66
  67. 67. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Show ItemFactors Matrix from ALS 67
  68. 68. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Convert to Item Factors -> mllib.Vector Required by K-Means Clustering Algorithm 68
  69. 69. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Fit and Evaluate K-Means Cluster Model 69 Measures Closeness Of Points Within Clusters K = 5 Clusters
  70. 70. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Netflix Genres and Clusters Typical Genres Documentary, Romance, Comedy, Horror, Action, Adventure Latent (Hidden) Clusters Emotionally-Independent Dramas for Hopeless Romantics Witty Dysfunctional-Family TV Animated Comedies Romantic Crime Movies based on Classic Literature Latin American Forbidden-Love Movies Critically-acclaimed Emotional Drug Movie Cerebral Military Movie based on Real Life Sentimental Movies about Horses for Ages 11-12 Gory Canadian Revenge Movies Raunchy Mad Scientist Comedy 70
  71. 71. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Presentation Outline ① Scaling ② Similarities ③ Recommendations ④ Approximations ⑤ Netflix Recommendations 71
  72. 72. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI When to Approximate? Memory or time constrained queries Relative vs. exact counts are OK (approx # errors after a release) Using machine learning or graph algos Inherently probabilistic and approximate Streaming aggregations Inherently sloppy collection (exactly once?) 72 Approximate as much as you can get away with! Ask for forgiveness later !!
  73. 73. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI When NOT to Approximate? If you’ve ever heard the term… “Sarbanes-Oxley” …at the office. 73
  74. 74. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI A Few Good Algorithms 74 You can’t handle the approximate!
  75. 75. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Common to These Algos & Data Structs Low, fixed size in memory Store large amount of data Known error bounds Tunable tradeoff between size and error Less memory than Java/Scala collections Rely on multiple hash functions or operations Size of hash range defines error 75
  76. 76. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Bloom Filter Set.contains(key): Boolean “Hash Multiple Times and Flip the Bits Wherever You Land” 76
  77. 77. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Bloom Filter Approximate Set.contains(key) No means No, Yes means Maybe Elements can only be added Never updated or removed 77
  78. 78. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Bloom Filter in Action 78 set(key) contains(key): Boolean Images by @avibryant Set.contains(key): TRUE -> maybe contains (other key hashes may overlap) Set.contains(key): FALSE -> definitely does not contain (no key flipped all bits)
  79. 79. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI CountMin Sketch Frequency Count and TopK “Hash Multiple Times and Add 1 Wherever You Land” 79
  80. 80. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI CountMin Sketch (CMS) Approximate frequency count and TopK for key ie. “Heavy Hitters” on Twitter 80 Matei Zaharia Martin Odersky Donald Trump
  81. 81. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI CountMin Sketch In Action (TopK Count) 81 Images derived from @avibryant Find minimum of all rows … … Can overestimate, but never underestimate Multiple hash functions (1 hash function per row) Binary hash output (1 element per column) x 2 occurrences of “Top Gun” for slightly additional complexity Top Gun Top Gun Top Gun (x 2) A Few Good Men Taps Top Gun (x 2) add(Top Gun, 2) getCount(Top Gun): Long Use Case: TopK movies using total views add(A Few Good Men, 1) add(Taps, 1) A Few Good Men Taps … … Overlap Top Gun Overlap A Few Good Men
  82. 82. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI HyperLogLog Count Distinct “Hash Multiple Times and Uniformly Distribute Where You Land” 82
  83. 83. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI HyperLogLog (HLL) Approximate count distinct Slight twist Special hash function creates uniform distribution Hash subsets of data with single, special hash func Error estimate 14 bits for size of range m = 2^14 = 16,384 hash slots error = 1.04/(sqrt(16,384)) = .81% 83 Not many of these
  84. 84. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI HyperLogLog In Action (Count Distinct) Use Case: Number of distinct users who view a movie 84 0 32 Top Gun: Hour 2 user 2001 user 4009 user 3002 user 7002 user 1005 user 6001 User 8001 User 8002 user 1001 user 2009 user 3005 user 3003 Top Gun: Hour 1 user 3001 user 7009 0 16 UniformDistribution: Estimate distinct # of users by inspecting just the beginning 0 32 Top Gun: Hour 1 + 2 user 2001 user 4009 user 3002 user 7002 user 1005 user 6001 User 8001 User 8002 Combine across different scales user 7009 user 1001 user 2009 user 3005 user 3003 user 3001
  85. 85. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Locality Sensitive Hashing Set Similarity “Pre-process Items into Buckets, Compare Within Buckets” 85
  86. 86. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Locality Sensitive Hashing (LSH) Approximate set similarity Pre-process m rows into b buckets b << m; b = buckets, m = rows Hash items multiple times ** Similar items hash to overlapping buckets ** Hash designed to cluster similar items Compare just contents of buckets Much smaller cartesian compare ** Compare in parallel !! Avoids huge cartesian all-pairs compare 86 Chapter 3: LSH
  87. 87. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI DIMSUM Set Similarity “Pre-process and ignore data that is unlikely to be similar.” 87
  88. 88. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI DIMSUM “Dimension Independent Matrix Square Using MR” Remove vectors with low probability of similarity RowMatrix.columnSimiliarites(threshold) Twitter DIMSUM Case Study 40% efficiency gain over bruce-force Cosine Sim 88
  89. 89. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Common Tools to Approximate Twitter Algebird Redis Apache Spark 89 Composable Library Distributed Cache Big Data Processing
  90. 90. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Twitter Algebird Algebraic Fundamentals Parallel Associative Composable Examples Min, Max, Avg BloomFilter (Set.contains(key)) HyperLogLog (Count Distinct) CountMin Sketch (TopK Count) 90
  91. 91. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Redis Implementation of HyperLogLog (Count Distinct) 12KB per item count 2^64 max # of items 0.81% error Add user views for given movie PFADD TopGun_Hour1_HLL user1001 user2009 user3005 PFADD TopGun_Hour1_HLL user3003 user1001 Get distinct count (cardinality) of set PFCOUNT TopGun_Hour1_HLL Returns: 4 (distinct users viewed this movie) Union 2 HyperLogLog Data Structures PFMERGE TopGun_Hour1_HLL TopGun_Hour2_HLL 91 ignore duplicates Tunable
  92. 92. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Approximations in Spark Libraries Spark Core countByKeyApprox(timeout: Long, confidence: Double) PartialResult Spark SQL approxCountDistinct(column: Column, targetResidual: Float) approxQuantile(column: Column, quantiles: Seq[Float], targetResidual: Float) Spark ML Stratified sampling sampleByKey(fractions: Map[K, Double]) DIMSUM sampling Probabilistic sampling reduces amount of shuffle RowMatrix.columnSimilarities(threshold: Double) 92
  93. 93. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Demo! Exact Count vs. Approximate HLL and CMS Count 93
  94. 94. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI HashSet vs. HyperLogLog (Memory) 94
  95. 95. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI HashSet vs. CountMin Sketch (Memory) 95
  96. 96. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Demo! Exact Similarity vs. Approximate LSH Similarity 96
  97. 97. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Brute Force Cartesian All Pair Similarity 97 47 seconds
  98. 98. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Locality Sensitive Hash All Pair Similarity 98 6 seconds
  99. 99. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Many More Demos! or Download Docker Clone on Github 99 http://advancedspark.com
  100. 100. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Presentation Outline ① Scaling ② Similarities ③ Recommendations ④ Approximations ⑤ Netflix Recommendations 100
  101. 101. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Netflix Recommendations From Ratings to Real-time 101
  102. 102. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Netflix Has a Lot of Data Netflix has a lot of data about a lot of users and a lot of movies. Netflix can use this data to buy new movies. Netflix is global. Netflix can use this data to choose original programming. Netflix knows that a lot of people like politics and Kevin Spacey. 102 The UK doesn’t have White Castle. Renamed my favourite movie to: “Harold and Kumar Get the Munchies” My favorite movie: “Harold and Kumar Go to White Castle” Summary: Buy NFLX Stock! This broke my unit tests!
  103. 103. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Netflix Data Pipeline - Then 103 v1.0 v2.0
  104. 104. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Netflix Data Pipeline – Now (Keystone) 104 v3.0 9 million events per second 22 GB per second!! EC2 D2XL Disk: 6 TB, 475 MB/s RAM: 30 G Network: 700 Mbps Auto-scaling, Fault tolerance A/B Tests, Trending Now SAMZA Splits high and normal priority
  105. 105. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Netflix Recommendation Data Pipeline 105 Throw away batch user factors (U) Keep batch video factors (V)
  106. 106. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Netflix Trending Now (Time-based Recs) Uses Spark Streaming Personalized to user (viewing history, past ratings) Learns and adapts to events (Valentine’s Day) 106 “VHS” Number of Plays Number of Impressions Calculate Take Rate
  107. 107. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Bonus: Pandora Time-based Recs Work Days Play familiar music User is less likely accept new music Evenings and Weekends Play new music More like to accept new music 107
  108. 108. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI $1 Million Netflix Prize (2006-2009) Goal Improve movie predictions by 10% (Root Mean Sq Error) Test data withheld to calculate RMSE upon submission 5-star Ratings Dataset (userId, movieId, rating, timestamp) Winning algorithm(s) 10.06% improvement (RMSE) Ensemble of 500+ ML combined with GBDT’s Computationally impractical 108
  109. 109. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Secrets to the Winning Algorithms Adjust for the following human bias… ① Alice effect: user rates lower than avg ② Inception effect: movie rated higher than avg ③ Overall mean rating of a movie ④ Number of people who have rated a movie ⑤ Number of days since user’s first rating ⑥ Number of days since movie’s first rating ⑦ Mood, time of day, day of week, season, weather 109
  110. 110. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Netflix Common ML Algorithms Logistic Regression Linear Regression Gradient Boosted Decision Trees Random Forest Matrix Factorization SVD Restricted Boltzmann Machines Deep Neural Nets Markov Models LDA Clustering 110 Ensembles!
  111. 111. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Netflix Genres and Clusters Typical Genres Documentaries, Romance Comedies, Horror, Action, Adventure Latent (Hidden) Clusters Emotionally-Independent Dramas for Hopeless Romantics Witty Dysfunctional-Family TV Animated Comedies Romantic Crime Movies based on Classic Literature Latin American Forbidden-Love Movies Critically-acclaimed Emotional Drug Movie Cerebral Military Movie based on Real Life Sentimental Movies about Horses for Ages 11-12 Gory Canadian Revenge Movies Raunchy Mad Scientist Comedy 111
  112. 112. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Netflix Social Integration Post to Facebook after movie start (5 mins) Recommend to new users based on friends Helps with Cold Start problem 112
  113. 113. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Netflix Search No results? No problem… Show similar results! Utilize extensive DVD Catalog Metadata search (ElasticSearch) Named entity recognition (NLP) Empty searches are opportunity! Explicit feedback for future recommendations Content to buy and produce! 113
  114. 114. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Netflix A/B Tests Users tend to click on images featuring… Faces with strong emotional expressions Villains over heroes Small number of cast members 114
  115. 115. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Netflix Recommendation Serving Layer Use Case: Recommendation service depends on EVCache Problem: EVCache cluster goes down or becomes latent!? Answer: github.com/Netflix/Hystrix Circuit Breaker! Circuit States Closed: Service OK Open: Service DOWN Fallback to Static 115
  116. 116. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Why Higher Average Ratings 2004+? 2004, Netflix noticed higher ratings on average Some possible reasons why… 116 ① Significant UI improvements deployed ② New recommendation engine deployed ③
  117. 117. Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI Thank You, Everyone!! Chris Fregly @cfregly Research Scientist @ Flux Capacitor AI San Francisco, California, USA http://fluxcapacitor.com Sign up for the Meetup and Book Contribute to Github Repo Run all Demos using Docker Find me LinkedIn, Twitter, Github, Email, Fax 117 Image derived from http://www.duchess-france.org/
  118. 118. Flux Capacitor AI Bringing AI Back to the Future! Bringing AI Back to the Future!

×