Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Buzz words 2012 - Real-time learning

4,121 views

Published on

This is the talk that I gave at Buzzwords. it is a heavily condensed version of the real-time learning talk I have given previously. The associated video is available at http://www.youtube.com/watch?v=1nQBnyaI6s0

Published in: Technology, Business
  • Be the first to comment

Buzz words 2012 - Real-time learning

  1. 1. Real-time and Long-time with Storm and Hadoop©MapR Technologies - Confidential 1
  2. 2. Real-time and Long-time with Storm and Hadoop MapR©MapR Technologies - Confidential 2
  3. 3. The Challenge Hadoop is great of processing vats of data – But sucks for real-time (by design!) Storm is great for real-time processing – But lacks any way to deal with batch processing It sounds like there isn’t a solution – Neither fashionable solution handles everything©MapR Technologies - Confidential 3
  4. 4. This is not a problem. It’s an opportunity!©MapR Technologies - Confidential 4
  5. 5. Hadoop is Not Very Real-time Unprocessed now Data t Fully Latest full Hadoop job processed period takes this long for this data©MapR Technologies - Confidential 5
  6. 6. Real-time and Long-time together Blended now View view t Hadoop works Storm great back here works here©MapR Technologies - Confidential 6
  7. 7. Rough Design – Data Flow Search Query Event Query Event Counter Counter Logger Engine Spout Spout Bolt Bolt Bolt Logger Logger Bolt Semi Snap Bolt Agg Raw Hadoop Logs Aggregator Long agg©MapR Technologies - Confidential 7
  8. 8. Guarantees Counter output volume is small-ish – the greater of k tuples per 100K inputs or k tuple/s – 1 tuple/s/label/bolt for this exercise Persistence layer must provide guarantees – distributed against node failure – must have either readable flush or closed-append HDFS is distributed, but provides no guarantees and strange semantics MapRfs is distributed, provides all necessary guarantees©MapR Technologies - Confidential 8
  9. 9. Presentation Layer Presentation must – read recent output of Logger bolt – read relevant output of Hadoop jobs – combine semi-aggregated records User will see – counts that increment within 0-2 s of events – seamless meld of short and long-term data©MapR Technologies - Confidential 9
  10. 10. Example 2 – AB testing in real-time I have 15 versions of my landing page Each visitor is assigned to a version – Which version? A conversion or sale or whatever can happen – How long to wait? Some versions of the landing page are horrible – Don’t want to give them traffic©MapR Technologies - Confidential 10
  11. 11. A Quick Diversion You see a coin – What is the probability of heads? – Could it be larger or smaller than that? I flip the coin and while it is in the air ask again I catch the coin and ask again I look at the coin (and you don’t) and ask again Why does the answer change? – And did it ever have a single value?©MapR Technologies - Confidential 11
  12. 12. A Philosophical Conclusion Probability as expressed by humans is subjective and depends on information and experience©MapR Technologies - Confidential 12
  13. 13. I Dunno©MapR Technologies - Confidential 13
  14. 14. 5 heads out of 10 throws©MapR Technologies - Confidential 14
  15. 15. 2 heads out of 12 throws©MapR Technologies - Confidential 15
  16. 16. Bayesian Bandit Compute distributions based on data Sample p1 and p2 from these distributions Put a coin in bandit 1 if p1 > p2 Else, put the coin in bandit 2©MapR Technologies - Confidential 16
  17. 17. And it works! 0.12 0.11 0.1 0.09 0.08 0.07 regret 0.06 ε- greedy, ε = 0.05 0.05 0.04 Bayesian Bandit with Gam m a- Norm al 0.03 0.02 0.01 0 0 100 200 300 400 500 600 700 800 900 1000 1100 n©MapR Technologies - Confidential 17
  18. 18. Video Demo©MapR Technologies - Confidential 18
  19. 19. The Code Select an alternative n = dim(k)[1] p0 = rep(0, length.out=n) for (i in 1:n) { p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1) } return (which(p0 == max(p0))) Select and learn for (z in 1:steps) { i = select(k) j = test(i) k[i,j] = k[i,j]+1 } return (k) But we already know how to count!©MapR Technologies - Confidential 19
  20. 20. The Basic Idea We can encode a distribution by sampling Sampling allows unification of exploration and exploitation Can be extended to more general response models©MapR Technologies - Confidential 20
  21. 21.  Contact: – tdunning@maprtech.com – @ted_dunning Slides and such (available late tonight): – http://info.mapr.com/ted-bbuzz-2012 Hash tags: #mapr #bbuzz Collective notes: http://bit.ly/JDCRhc©MapR Technologies - Confidential 21
  22. 22. Thank You©MapR Technologies - Confidential 22

×