Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2014 MapR Technologies 1
© MapR Technologies, confidential
Hadoop Summit 2014
Which Algorithms Really Matter?
© 2014 MapR Technologies 2
Me, Us
• Ted Dunning, Chief Application Architect, MapR
Committer PMC member, Mahout, Zookeeper...
© 2014 MapR Technologies 4
Topic For Today
• What is important? What is not?
• Why?
• What is the difference from academic...
© 2014 MapR Technologies 5
What is Important?
• Deployable
• Robust
• Transparent
• Skillset and mindset matched?
• Propor...
© 2014 MapR Technologies 6
What is Important?
• Deployable
– Clever prototypes don’t count if they can’t be standardized
•...
© 2014 MapR Technologies 7
What is Important?
• Deployable
– Clever prototypes don’t count
• Robust
– Mishandling is commo...
© 2014 MapR Technologies 8
What is Important?
• Deployable
– Clever prototypes don’t count
• Robust
– Mishandling is commo...
© 2014 MapR Technologies 9
Academic Goals vs Pragmatics
• Academic goals
– Reproducible
– Isolate theoretically important ...
© 2014 MapR Technologies 10
Example 1:
Making Recommendations Better
© 2014 MapR Technologies 11
Recommendation Advances
• What are the most important algorithmic advances in
recommendations ...
© 2014 MapR Technologies 12
The Winner – None of the Above
• What are the most important algorithmic advances in
recommend...
© 2014 MapR Technologies 13
The Real Issues
• Exploration
• Diversity
• Speed
• Not the last fraction of a percent
© 2014 MapR Technologies 14
Result Dithering
• Dithering is used to re-order recommendation results
– Re-ordering is done ...
© 2014 MapR Technologies 15
Result Dithering
• Dithering is used to re-order recommendation results
– Re-ordering is done ...
© 2014 MapR Technologies 16
Simple Dithering Algorithm
• Generate synthetic score from log rank plus Gaussian
• Pick noise...
© 2014 MapR Technologies 17
Example … ε = 0.5
1 2 6 5 3 4 13 16
1 2 3 8 5 7 6 34
1 4 3 2 6 7 11 10
1 2 4 3 15 7 13 19
1 6 ...
© 2014 MapR Technologies 18
Example … ε = log 2 = 0.69
1 2 8 3 9 15 7 6
1 8 14 15 3 2 22 10
1 3 8 2 10 5 7 4
1 2 10 7 3 8 ...
© 2014 MapR Technologies 19
Exploring The Second Page
© 2014 MapR Technologies 20
Lesson 1:
Exploration is good
© 2014 MapR Technologies 21
Example 2:
Bayesian Bandits
© 2014 MapR Technologies 22
Bayesian Bandits
• Based on Thompson sampling
• Very general sequential test
• Near optimal re...
© 2014 MapR Technologies 23
Thompson Sampling
• Select each shell according to the probability that it is the best
• Proba...
© 2014 MapR Technologies 24
Thompson Sampling – Take 2
• Sample θ
• Pick i to maximize reward
• Record result from using i...
© 2014 MapR Technologies 25
Fast Convergence
11000 100 200 300 400 500 600 700 800 900 1000
0.12
0
0.01
0.02
0.03
0.04
0.0...
© 2014 MapR Technologies 26
Thompson Sampling on Ads
An Empirical Evaluation of Thompson Sampling - Chapelle and Li, 2011
© 2014 MapR Technologies 27
Bayesian Bandits versus Result Dithering
• Many useful systems are difficult to frame in fully...
© 2014 MapR Technologies 28
Lesson 2:
Exploration is pretty easy to
do and pays big benefits.
© 2014 MapR Technologies 29
Example 3:
On-line Clustering
© 2014 MapR Technologies 30
The Problem
• K-means clustering is useful for feature extraction or
compression
• At scale an...
© 2014 MapR Technologies 31
The Solution
• Sketch-based algorithms produce a sketch of the data
• Streaming k-means uses a...
© 2014 MapR Technologies 32
An Example
© 2014 MapR Technologies 33
An Example
© 2014 MapR Technologies 34
The Cluster Proximity Features
• Every point can be described by the nearest cluster
– 4.3 bit...
© 2014 MapR Technologies 35
Diagonalized Cluster Proximity
© 2014 MapR Technologies 36
Lots of Clusters Are Fine
© 2014 MapR Technologies 37
Typical k-means Failure
Selecting two seeds
here cannot be
fixed with Lloyds
Result is that th...
© 2014 MapR Technologies 38
Streaming k-means Ideas
• By using a sketch with lots (k log N) of centroids, we avoid
patholo...
© 2014 MapR Technologies 39
Lesson 3:
Sketches make big data small.
© 2014 MapR Technologies 40
Example 4:
Search Abuse
© 2014 MapR Technologies 41
Recommendation
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
Bob Bob got an...
© 2014 MapR Technologies 42
Recommendation
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
Bob Bob got an...
© 2014 MapR Technologies 43
Recommendation
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
Bob A puppy!
© 2014 MapR Technologies 44
History Matrix: Users x Items
Alice
Bob
Charles
✔ ✔ ✔
✔ ✔
✔ ✔
© 2014 MapR Technologies 45
Co-Occurrence Matrix: Items x Items
-
1 2
1 1
1
1
2 1
0
0
0 0
Use LLR test to turn co-
occurre...
© 2014 MapR Technologies 46
Indicator Matrix: Anomalous Co-Occurrence
✔
✔
© 2014 MapR Technologies 47
Co-occurrence Binary Matrix
1
1not
not
1
© 2014 MapR Technologies 48
Indicator Matrix: Anomalous Co-Occurrence
✔
✔
Result: The marked row will be added to the indi...
© 2014 MapR Technologies 49
Indicator Matrix
✔
id: t4
title: puppy
desc: The sweetest little puppy ever.
keywords: puppy, ...
© 2014 MapR Technologies 50
Internals of the Recommender Engine
50
© 2014 MapR Technologies 51
Internals of the Recommender Engine
51
© 2014 MapR Technologies 52
Looking Inside LucidWorks
What to recommend if new user listened to 2122: Fats Domino & 303: B...
© 2014 MapR Technologies 53
Real-life example
© 2014 MapR Technologies 54
Lesson 4:
Recursive search abuse pays
Search can implement recs
Which can implement search
© 2014 MapR Technologies 55
Summary
© 2014 MapR Technologies 56
© 2014 MapR Technologies 57
Me, Us
• Ted Dunning, Chief Application Architect, MapR
Committer PMC member, Mahout, Zookeepe...
Upcoming SlideShare
Loading in …5
×

How to Determine which Algorithms Really Matter

1,027 views

Published on

Published in: Technology, Education
  • Be the first to comment

How to Determine which Algorithms Really Matter

  1. 1. © 2014 MapR Technologies 1 © MapR Technologies, confidential Hadoop Summit 2014 Which Algorithms Really Matter?
  2. 2. © 2014 MapR Technologies 2 Me, Us • Ted Dunning, Chief Application Architect, MapR Committer PMC member, Mahout, Zookeeper, Drill Bought the beer at the first HUG • MapR Distributes more open source components for Hadoop Adds major technology for performance, HA, industry standard API’s • Info Hash tag - #mapr See also - @ApacheMahout @ApacheDrill @ted_dunning and @mapR
  3. 3. © 2014 MapR Technologies 4 Topic For Today • What is important? What is not? • Why? • What is the difference from academic research? • Some examples
  4. 4. © 2014 MapR Technologies 5 What is Important? • Deployable • Robust • Transparent • Skillset and mindset matched? • Proportionate
  5. 5. © 2014 MapR Technologies 6 What is Important? • Deployable – Clever prototypes don’t count if they can’t be standardized • Robust • Transparent • Skillset and mindset matched? • Proportionate
  6. 6. © 2014 MapR Technologies 7 What is Important? • Deployable – Clever prototypes don’t count • Robust – Mishandling is common • Transparent – Will degradation be obvious? • Skillset and mindset matched? • Proportionate
  7. 7. © 2014 MapR Technologies 8 What is Important? • Deployable – Clever prototypes don’t count • Robust – Mishandling is common • Transparent – Will degradation be obvious? • Skillset and mindset matched? – How long will your fancy data scientist enjoy doing standard ops tasks? • Proportionate – Where is the highest value per minute of effort?
  8. 8. © 2014 MapR Technologies 9 Academic Goals vs Pragmatics • Academic goals – Reproducible – Isolate theoretically important aspects – Work on novel problems • Pragmatics – Highest net value – Available data is constantly changing – Diligence and consistency have larger impact than cleverness – Many systems feed themselves, exploration and exploitation are both important – Engineering constraints on budget and schedule
  9. 9. © 2014 MapR Technologies 10 Example 1: Making Recommendations Better
  10. 10. © 2014 MapR Technologies 11 Recommendation Advances • What are the most important algorithmic advances in recommendations over the last 10 years? • Cooccurrence analysis? • Matrix completion via factorization? • Latent factor log-linear models? • Temporal dynamics?
  11. 11. © 2014 MapR Technologies 12 The Winner – None of the Above • What are the most important algorithmic advances in recommendations over the last 10 years? 1. Result dithering 2. Anti-flood
  12. 12. © 2014 MapR Technologies 13 The Real Issues • Exploration • Diversity • Speed • Not the last fraction of a percent
  13. 13. © 2014 MapR Technologies 14 Result Dithering • Dithering is used to re-order recommendation results – Re-ordering is done randomly • Dithering is guaranteed to make off-line performance worse • Dithering also has a near perfect record of making actual performance much better
  14. 14. © 2014 MapR Technologies 15 Result Dithering • Dithering is used to re-order recommendation results – Re-ordering is done randomly • Dithering is guaranteed to make off-line performance worse • Dithering also has a near perfect record of making actual performance much better “Made more difference than any other change”
  15. 15. © 2014 MapR Technologies 16 Simple Dithering Algorithm • Generate synthetic score from log rank plus Gaussian • Pick noise scale to provide desired level of mixing • Typically • Oh… use floor(t/T) as seed s = logr + N(0,e) e Î 0.4, 0.8[ ] Dr µrexpe
  16. 16. © 2014 MapR Technologies 17 Example … ε = 0.5 1 2 6 5 3 4 13 16 1 2 3 8 5 7 6 34 1 4 3 2 6 7 11 10 1 2 4 3 15 7 13 19 1 6 2 3 4 16 9 5 1 2 3 5 24 7 17 13 1 2 3 4 6 12 5 14 2 1 3 5 7 6 4 17 4 1 2 7 3 9 8 5 2 1 5 3 4 7 13 6 3 1 5 4 2 7 8 6 2 1 3 4 7 12 17 16
  17. 17. © 2014 MapR Technologies 18 Example … ε = log 2 = 0.69 1 2 8 3 9 15 7 6 1 8 14 15 3 2 22 10 1 3 8 2 10 5 7 4 1 2 10 7 3 8 6 14 1 5 33 15 2 9 11 29 1 2 7 3 5 4 19 6 1 3 5 23 9 7 4 2 2 4 11 8 3 1 44 9 2 3 1 4 6 7 8 33 3 4 1 2 10 11 15 14 11 1 2 4 5 7 3 14 1 8 7 3 22 11 2 33
  18. 18. © 2014 MapR Technologies 19 Exploring The Second Page
  19. 19. © 2014 MapR Technologies 20 Lesson 1: Exploration is good
  20. 20. © 2014 MapR Technologies 21 Example 2: Bayesian Bandits
  21. 21. © 2014 MapR Technologies 22 Bayesian Bandits • Based on Thompson sampling • Very general sequential test • Near optimal regret • Trade-off exploration and exploitation • Possibly best known solution for exploration/exploitation • Incredibly simple
  22. 22. © 2014 MapR Technologies 23 Thompson Sampling • Select each shell according to the probability that it is the best • Probability that it is the best can be computed using posterior • But I promised a simple answer P(i is best) = I E[ri |q]= max j E[rj |q] é ëê ù ûúò P(q | D) dq
  23. 23. © 2014 MapR Technologies 24 Thompson Sampling – Take 2 • Sample θ • Pick i to maximize reward • Record result from using i q ~P(q | D) i = argmax j E[rj |q]
  24. 24. © 2014 MapR Technologies 25 Fast Convergence 11000 100 200 300 400 500 600 700 800 900 1000 0.12 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 n regret ε- greedy, ε = 0.05 Bayesian Bandit with Gamma- Normal
  25. 25. © 2014 MapR Technologies 26 Thompson Sampling on Ads An Empirical Evaluation of Thompson Sampling - Chapelle and Li, 2011
  26. 26. © 2014 MapR Technologies 27 Bayesian Bandits versus Result Dithering • Many useful systems are difficult to frame in fully Bayesian form • Thompson sampling cannot be applied without posterior sampling • Can still do useful exploration with dithering • But better to use Thompson sampling if possible
  27. 27. © 2014 MapR Technologies 28 Lesson 2: Exploration is pretty easy to do and pays big benefits.
  28. 28. © 2014 MapR Technologies 29 Example 3: On-line Clustering
  29. 29. © 2014 MapR Technologies 30 The Problem • K-means clustering is useful for feature extraction or compression • At scale and at high dimension, the desirable number of clusters increases • Very large number of clusters may require more passes through the data • Super-linear scaling is generally infeasible
  30. 30. © 2014 MapR Technologies 31 The Solution • Sketch-based algorithms produce a sketch of the data • Streaming k-means uses adaptive dp-means to produce this sketch in the form of many weighted centroids which approximate the original distribution • The size of the sketch grows very slowly with increasing data size • Many operations such as clustering are well behaved on sketches Fast and Accurate k-means For Large Datasets. Michael Shindler, Alex Wong, Adam Meyerson. Revisiting k-means: New Algorithms via Bayesian Nonparametrics . Brian Kulis, Michael Jordan.
  31. 31. © 2014 MapR Technologies 32 An Example
  32. 32. © 2014 MapR Technologies 33 An Example
  33. 33. © 2014 MapR Technologies 34 The Cluster Proximity Features • Every point can be described by the nearest cluster – 4.3 bits per point in this case – Significant error that can be decreased (to a point) by increasing number of clusters • Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign bit + 2 proximities) – Error is negligible – Unwinds the data into a simple representation • Or we can increase the number of clusters (n fold increase adds log n bits per point, decreases error by sqrt(n)
  34. 34. © 2014 MapR Technologies 35 Diagonalized Cluster Proximity
  35. 35. © 2014 MapR Technologies 36 Lots of Clusters Are Fine
  36. 36. © 2014 MapR Technologies 37 Typical k-means Failure Selecting two seeds here cannot be fixed with Lloyds Result is that these two clusters get glued together
  37. 37. © 2014 MapR Technologies 38 Streaming k-means Ideas • By using a sketch with lots (k log N) of centroids, we avoid pathological cases • We still get a very good result if the sketch is created – in one pass – with approximate search • In fact, adaptive dp-means works just fine • In the end, the sketch can be used for clustering or …
  38. 38. © 2014 MapR Technologies 39 Lesson 3: Sketches make big data small.
  39. 39. © 2014 MapR Technologies 40 Example 4: Search Abuse
  40. 40. © 2014 MapR Technologies 41 Recommendation Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob Bob got an apple
  41. 41. © 2014 MapR Technologies 42 Recommendation Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob Bob got an apple. What else would Bob like?
  42. 42. © 2014 MapR Technologies 43 Recommendation Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob A puppy!
  43. 43. © 2014 MapR Technologies 44 History Matrix: Users x Items Alice Bob Charles ✔ ✔ ✔ ✔ ✔ ✔ ✔
  44. 44. © 2014 MapR Technologies 45 Co-Occurrence Matrix: Items x Items - 1 2 1 1 1 1 2 1 0 0 0 0 Use LLR test to turn co- occurrence into indicators of interesting co-occurrence
  45. 45. © 2014 MapR Technologies 46 Indicator Matrix: Anomalous Co-Occurrence ✔ ✔
  46. 46. © 2014 MapR Technologies 47 Co-occurrence Binary Matrix 1 1not not 1
  47. 47. © 2014 MapR Technologies 48 Indicator Matrix: Anomalous Co-Occurrence ✔ ✔ Result: The marked row will be added to the indicator field in the item document…
  48. 48. © 2014 MapR Technologies 49 Indicator Matrix ✔ id: t4 title: puppy desc: The sweetest little puppy ever. keywords: puppy, dog, pet indicators: (t1) That one row from indicator matrix becomes the indicator field in the Solr document used to deploy the recommendation engine. Note: data for the indicator field is added directly to meta-data for a document in Solr index. You don’t need to create a separate index for the indicators.
  49. 49. © 2014 MapR Technologies 50 Internals of the Recommender Engine 50
  50. 50. © 2014 MapR Technologies 51 Internals of the Recommender Engine 51
  51. 51. © 2014 MapR Technologies 52 Looking Inside LucidWorks What to recommend if new user listened to 2122: Fats Domino & 303: Beatles? Recommendation is “1710 : Chuck Berry” 52 Real-time recommendation query and results: Evaluation
  52. 52. © 2014 MapR Technologies 53 Real-life example
  53. 53. © 2014 MapR Technologies 54 Lesson 4: Recursive search abuse pays Search can implement recs Which can implement search
  54. 54. © 2014 MapR Technologies 55 Summary
  55. 55. © 2014 MapR Technologies 56
  56. 56. © 2014 MapR Technologies 57 Me, Us • Ted Dunning, Chief Application Architect, MapR Committer PMC member, Mahout, Zookeeper, Drill Bought the beer at the first HUG • MapR Distributes more open source components for Hadoop Adds major technology for performance, HA, industry standard API’s • Info Hash tag - #mapr See also - @ApacheMahout @ApacheDrill @ted_dunning and @mapR

×