Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Dublin Ireland Spark Meetup October... by Chris Fregly 565 views
- Scotland Data Science Meetup Oct 13... by Chris Fregly 1260 views
- Chicago Spark Meetup 03 01 2016 - S... by Chris Fregly 791 views
- Atlanta MLconf Machine Learning Con... by Chris Fregly 748 views
- Spark After Dark 2.0 - Apache Big D... by Chris Fregly 531 views
- Atlanta Spark User Meetup 09 22 2016 by Chris Fregly 291 views

1,267 views

Published on

Real-time, Advanced Analytics and Recommendations using Machine Learning, Natural Language Processing, Graph Processing, and Approximations with Apache Spark, Stanford CoreNLP, and Twitter Algebird

Agenda

Intro

Live, Interactive Recommendations Demo

Spark ML, GraphX, Streaming, Kafka, Cassandra, Docker

Types of Similarity

Euclidean vs. Non-Euclidean Similarity

User-to-User Similarity

Content-based, Item-to-Item Similarity (Amazon)

Collaborative-based, User-to-Item Similarity (Netflix)

Graph-based, Item-to-Item Similarity Pathway (Spotify)

Similarity Approximations at Scale

Twitter Algebird

MinHash and Bucketing

Locality Sensitive Hashing (LSH)

Netflix Recommendations: From Ratings to Real-Time

DVD-Ratings-based $1M Netflix Prize (2009)

Streaming-based "Trending Now" (2016)

Wrap Up

Q & A

*Bio*

Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, and a Netflix Open Source Committer. Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com. Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.

*Related Links*

https://github.com/fluxcapacitor/pipeline/wiki

http://cdn.oreillystatic.com/en/assets/1/event/105/Algebra%20for%20Scalable%20Analytics%20Presentation.pdf

http://static.echonest.com/BoilTheFrog/

http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf

http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/

http://www.cc.gatech.edu/~zha/CSE8801/CF/kdd-fp074-koren.pdf

Published in:
Software

No Downloads

Total views

1,267

On SlideShare

0

From Embeds

0

Number of Embeds

108

Shares

0

Downloads

50

Comments

0

Likes

11

No embeds

No notes for slide

- 1. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Spark and Recommendations Spark, Streaming, Machine Learning, Graph Processing, Approximations, Probabilistic Data Structures, NLP USF Seminar Series Thanks, USF!! Feb 5th, 2016 Chris Fregly Principal Data Solutions Engineer We’re Hiring! (Only Nice People) advancedspark.com!
- 2. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Who Am I? 2 Streaming Data Engineer Netﬂix OSS Committer Data Solutions Engineer Apache Contributor Principal Data Solutions Engineer IBM Technology Center Meetup Organizer Advanced Apache Meetup Book Author Advanced . Due 2016
- 3. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Advanced Apache Spark Meetup http://advancedspark.com Meetup Metrics Top 5 Most-active Spark Meetup! 2400+ Members in just 6 mos!! 2500+ Docker image downloads Meetup Mission Deep-dive into Spark and related open source projects Surface key patterns and idioms Focus on distributed systems, scale, and performance 3
- 4. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Live, Interactive Demo!! Audience Participation Required (cell phone or laptop) 4
- 5. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark demo.advancedspark.com End User -> ElasticSearch -> Spark ML -> Data Scientist -> 5 <- Kafka <- Spark Streaming <- Cassandra, Redis <- Zeppelin, iPython
- 6. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline Scaling with Parallelism and Composability Similarity and Recommendations When to Approximate Common Algorithms and Data Structures Common Libraries and Tools 6
- 7. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Scaling with Parallelism 7 Peter O(log n) O(log n)
- 8. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Scaling with Composability Max (a max b max c max d) == (a max b) max (c max d) Set Union (a U b U c U d) == (a U b) U (c U d) Addition (a + b + c + d) == (a + b) + (c + d) Multiply (a * b * c * d) == (a * b) * (c * d) Division?? 8
- 9. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark What about Division? Division (a / b / c / d) != (a / b) / (c / d) (3 / 4 / 7 / 8) != (3 / 4) / (7 / 8) (((3 / 4) / 7) / 8) != ((3 * 8) / (4 * 7)) 0.134 != 0.857 9 What were the Egyptians thinking?! Not Composable “Divide like an Egyptian”
- 10. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark What about Average? Overall AVG ( [3, 1] ((3 + 5) + (5 + 7)) 20 [5, 1] == ----------------------- == --- == 5 [5, 1] ((1 + 2) + 1) 4 [7, 1] ) 10 value count Pairwise AVG (3 + 5) (5 + 7) 8 12 20 ------- + ------- == --- + --- == --- == 10 != 5 2 2 2 2 2 Divide, Add, Divide? Not Composable Single Divide at the End? Doesn’t need to be Composable! AVG (3, 5, 5, 7) == 5 Add, Add, Add? Composable!
- 11. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline Scaling with Parallelism and Composability Similarity and Recommendations When to Approximate Common Algorithms and Data Structures Common Libraries and Tools 11
- 12. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Similarity 12
- 13. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Euclidean Similarity Exists in Euclidean, ﬂat space Based on Euclidean distance Linear measure Bias towards magnitude 13
- 14. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Cosine Similarity Angular measure Adjusts for Euclidean magnitude bias 14 Normalizes to unit vectors
- 15. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Jaccard Similarity Set similarity measurement Set intersection / set union -> Based on Jaccard distance Bias towards popularity 15
- 16. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Log Likelihood Similarity Adjusts for popularity bias Netﬂix “Shawshank” problem 16
- 17. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Word Similarity Edit Distance Calculate char diﬀerences between words Deletes, transposes, replaces, inserts 17
- 18. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Document Similarity TD/IDF Term Freq / Inverse Document Freq Used by most search engines Word2Vec Words embedded in vector space nearby similars 18
- 19. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Similarity Pathway ie. Closest recommendations between 2 people 19
- 20. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Calculating Similarity Exact Brute-Force “All-pairs similarity” aka “Pair-wise similarity”, “Similarity join” Cartesian O(n^2) shuﬄe and comparison Approximate Sampling Bucketing (aka “Partitioning”, “Clustering”) Remove data with low probability of similarity Reduce shuﬄe and comparisons 20
- 21. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Bonus: Document Summary Text Rank aka “Sentence Rank” TF/IDF + Similarity Graph + PageRank Intuition Surface summary sentences (abstract) Most similar to all others (TF/IDF + Similarity Graph) Most inﬂuential sentences (PageRank) 21
- 22. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Similarity Graph Vertex is movie, tag, actor, plot summary, etc. Edges are relationships and weights 22
- 23. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Topic-Sensitive PageRank Graph diﬀusion algorithm Pre-process graph, add vector of probabilities to each vertex Probability of ending up at this vertex from every other vertex 23
- 24. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Recommendations 24
- 25. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Basic Terminology User: User seeking recommendations Item: Item being recommended Explicit User Feedback: like or rating Implicit User Feedback: search, click, hover, view, scroll Instances: Rows of user feedback/input data Overﬁtting: Training a model too closely to the training data & hyperparameters Hold Out Split: Holding out some of the instances to avoid overﬁtting Features: Columns of instance rows (of feedback/input data) Cold Start Problem: Not enough data to personalize (new) Hyperparameter: Model-speciﬁc conﬁg knobs for tuning (tree depth, iterations) Model Evaluation: Compare predictions to actual values of hold out split Feature Engineering: Modify, reduce, combine features 25
- 26. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Feature Engineering Dimension Reduction Reduce number of features (aka “feature space”) Principle Component Analysis (PCA) Find principle features that describe the data in terms of variance Peel the dimensional layers back until you describe the data Example: One-Hot Encoding Convert categorical feature values to 0’s, 1’s Remove any hint of a relationship between the categories Bears -> 1 Bears -> [1,0,0] 49’ers -> 2 --> 49’ers -> [0,1,0] Steelers-> 3 Steelers-> [0,0,1] 26 1 binary column per category
- 27. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Features Binary Features: True or False Numeric Discrete Features: Integers Numeric Features: Real values Ordinal Features: Maintains order (S -> M -> L -> XL -> XXL) Temporal Features: Time-based (Time of Day, Binge Watching) Categorical Features: Finite, unique set of categories (NFL teams) 27
- 28. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Non-Personalized Recommendations 28
- 29. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Cold Start Problem “Cold Start” problem New user, don’t know their pref, must show them something! Movies with highest-rated actors Top K Aggregations Most desirable singles PageRank of like activity Facebook social graph Recommend friend activity 29
- 30. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Personalized Recommendations 30
- 31. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Clustering (aka. Nearest Neighbors) User-to-User Clustering Similar movies watched or rated Similar wiewing pattern (ie. binge or casual) Item-to-Item Clustering Similar tags/genres on movies Similar textual description (TF/IDF, Word2Vec, NLP, Image) 31 http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html!My OKCupid Proﬁle! My Hinge Proﬁle!
- 32. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark User-to-Item Collaborative Filtering Matrix Factorization ① Factor the large matrix (left) into 2 smaller matrices (right) ② Fill in the missing values with in the large matrix 32
- 33. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Item-to-Item Collaborative Filtering Made famous by Amazon Paper ~2003 Problem As # of users grew, Matrix Factorization couldn’t scale Solution Oﬄine/Batch Generate itemId -> List[customerId] vectors Online/Real-time For each item in cart, recommend similar items from vector space 33
- 34. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline Scaling with Parallelism and Composability Similarity and Recommendations When to Approximate Common Algorithms and Data Structures Common Libraries and Tools 34
- 35. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark When to Approximate? Memory or time constrained queries Relative vs. exact counts are OK (# errors between then and now) Using machine learning or graph algos Inherently probabilistic and approximate Finding topics in documents (LDA) Finding similar pairs of users, items, words at scale (LSH) Finding top inﬂuencers (PageRank) Streaming aggregations (distinct count or top k) Inherently sloppy means of collecting (at least once delivery) 35 Approximate as much as you can get away with! Ask for forgiveness later !!
- 36. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark When NOT to Approximate? If you’ve ever heard the term… “Sarbanes-Oxley” …in-that-order, at the oﬃce, after 2002. 36
- 37. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline Scaling with Parallelism and Composability Similarity and Recommendations When to Approximate Common Algorithms and Data Structures Common Libraries and Tools 37
- 38. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc A Few Good Algorithms 38 You can’t handle the approximate!
- 39. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Common to These Algos & Data Structs Low, ﬁxed size in memory Known error bounds Store large amount of data Less memory than Java/Scala collections Tunable tradeoﬀ between size and error Rely on multiple hash functions or operations Size of hash range deﬁnes error 39
- 40. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Bloom Filter Set.contains(key): Boolean “Hash Multiple Times and Flip the Bits Wherever You Land” 40
- 41. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Bloom Filter Approximate set membership for key False positive: expect contains(), actual !contains() True negative: expect !contains(), actual !contains() Elements only added, never removed 41
- 42. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Bloom Filter in Action 42 set(key) contains(key): Boolean Images by @avibryant TRUE -> maybe contains FALSE -> deﬁnitely does not contain.
- 43. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc CountMin Sketch Frequency Count and TopK “Hash Multiple Times and Add 1 Wherever You Land” 43
- 44. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CountMin Sketch (CMS) Approximate frequency count and TopK for key ie. “Heavy Hitters” on Twitter 44 Johnny Hallyday Martin Odersky Donald Trump
- 45. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CountMin Sketch In Action 45 Images derived from @avibryant Find minimum of all rows … … Can overestimate, but never underestimate Multiple hash functions (1 hash function per row) Binary hash output (1 element per column) x 2 occurrences of “Top Gun” for slightly additional complexity Top Gun Top Gun Top Gun (x 2) A Few Good Men Taps Top Gun (x 2) add(Top Gun, 2) getCount(Top Gun): Long Use Case: TopK movies using total views add(A Few Good Men, 1) add(Taps, 1) A Few Good Men Taps … … Overlap Top Gun Overlap A Few Good Men
- 46. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc HyperLogLog Count Distinct “Hash Multiple Times and Uniformly Distribute Where You Land” 46
- 47. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark HyperLogLog (HLL) Approximate count distinct Slight twist Special hash function creates uniform distribution Error estimate 14 bits for size of range m = 2^14 = 16,384 slots error = 1.04/(sqrt(16,384)) = .81% 47 Not many of these
- 48. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark HyperLogLog In Action Use Case: Distinct number of views per movie 48 0 32 Top Gun: Hour 2 user 2001 user 4009 user 3002 user 7002 user 1005 user 6001 User 8001 User 8002 user 1001 user 2009 user 3005 user 3003 Top Gun: Hour 1 user 3001 user 7009 0 16 Uniform Distribution: Estimate distinct # of users by inspecting just the beginning Uniform Distribution: Estimate distinct # of users by inspecting just the beginning Composable: Hour 1 + 2 (lose a bit of precision)
- 49. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Locality Sensitive Hashing Set Similarity “Pre-process Items into Buckets, Compare Within Buckets” 49
- 50. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Locality Sensitive Hashing (LSH) Approximate set similarity Hash designed to cluster similar items Avoids cartesian all-pairs comparison Pre-process m rows into b buckets b << m Hash items multiple times Similar items hash to overlapping buckets Compare just contents of buckets Much smaller cartesian … and parallel !! 50
- 51. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc DIMSUM Set Similarity “Pre-process Items into Buckets, Compare Within Buckets” 51
- 52. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark DIMSUM “Dimension Independent Matrix Square Using MR” Remove vectors with low probability of similarity RowMatrix.columnSimiliarites(threshold) Twitter DIMSUM Case Study 40% eﬃciency gain over bruce-force cosine sim 52
- 53. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline Scaling with Parallelism and Composability Similarity and Recommendations When to Approximate Common Algorithms and Data Structures Common Libraries and Tools 53
- 54. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Common Tools to Approximate Twitter Algebird Redis Apache Spark 54 Composable Library Distributed Cache Big Data Processing
- 55. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Twitter Algebird Rooted in Algebraic Fundamentals! Parallel Associative Composable Examples Min, Max, Avg BloomFilter (Set.contains(key)) HyperLogLog (Count Distinct) CountMin Sketch (TopK Count) 55
- 56. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Redis Implementation of HyperLogLog (Count Distinct) 12KB per item count 2^64 max # of items 0.81% error (Tunable) Add user views for given movie PFADD TopGun_HLL user1001 user2009 user3005 PFADD TopGun_HLL user3003 user1001 Get distinct count (cardinality) of set PFCOUNT TopGun_HLL Returns: 4 (distinct users viewed this movie) 56 ignore duplicates Tunable Union 2 HyperLogLog Data Structures PFMERGE TopGun_HLL Taps_HLL
- 57. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark Approximations Spark Core RDD.count*Approx() Spark SQL PartialResult HyperLogLogPlus approxCountDistinct(column) Spark ML Stratiﬁed sampling PairRDD.sampleByKey(fractions: Double[ ]) DIMSUM sampling Probabilistic sampling reduces amount of comparison shuﬄe RowMatrix.columnSimilarities(threshold) Spark Streaming A/B testing StreamingTest.setTestMethod(“welch”).registerStream(dstream) 57
- 58. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Demos! 58
- 59. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Counting Exact Count vs. Approx HyperLogLog, CountMin Sketch 59
- 60. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark HashSet vs. HyperLogLog 60
- 61. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark HashSet vs. CountMin Sketch 61
- 62. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Set Similarity Exact Jaccard Similarity vs. Approx Locality Sensitive Hashing 62
- 63. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Brute Force Cartesian All Pair Similarity 63 90 mins!
- 64. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark All Pairs & Locality Sensitive Hashing 64 << 90 mins!
- 65. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Many More Demos Available! http://advancedspark.com Download Docker or Clone Github 65
- 66. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Bonus: Netﬂix Recommendations From Oﬄine DVD Ratings to Real-time Trending Now 66
- 67. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark $1 Million Netﬂix Prize (2006-2009) Goal Improve movie predictions by 10% (RMSE) Dataset (userId, movieId, rating, timestamp) Test data withheld to calculate RMSE upon submission Winning algorithm 10.06% improvement (RMSE) Ensemble of 500+ ML Combined using GBDT’s Computationally impractical 67
- 68. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Secret to the Winning Algorithms Adjust for the following… Human bias “Alice eﬀect”: Alice tends to rate lower than average user “Inception eﬀect”: Inception is rated higher than average “Alice-Inception eﬀect”: Combo of Alice and Inception Time-based bias Number of days since a user’s ﬁrst rating Number of days since a movie’s ﬁrst rating Number of people who have rated a movie A movie’s overall mean rating 68
- 69. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Current Netﬂix Recommendations 69 Throw away loﬄine-generated user factors (U)
- 70. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Netﬂix Common ML Algorithms Logistic Regression Linear Regression Gradient Boosted Decision Trees Random Forest Matrix Factorization SVD Restricted Boltzmann Machines Deep Neural Nets Markov Models LDA Clustering … 70
- 71. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Bonus: Netﬂix Search No results? No problem… Show similar results! Used as implicit feedback for future decision making 71
- 72. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Netﬂix and Data Netﬂix has a lot of data about a lot of users and a lot of movies. Netﬂix can use this data to buy new movies. Netﬂix is global. Netﬂix can use this data to choose original programming. Netﬂix knows that a lot of people like Politics and Kevin Spacey. 72 The UK doesn’t have any White Castles. So they renamed my favourite movie, “Harold and Kumar Get the Munchies” (This broke all of my unit tests.) My favorite movie, “Harold and Kumar Go to White Castle” Summary: Buy NFLX Stock!
- 73. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Thank You!! Chris Fregly @cfregly IBM Spark Tech Center http://spark.tc San Francisco, California, USA http://advancedspark.com Sign up for the Meetup and Book Contribute to Github Repo Run all Demos using Docker Find me: LinkedIn, Twitter, Github, Email, Fax 73 Image derived from http://www.duchess-france.org/
- 74. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark advancedspark.com @cfregly

No public clipboards found for this slide

Be the first to comment