Real-time and Long-time with                        Storm and Hadoop©MapR Technologies - Confidential   1
Real-time and Long-time with                 Storm and Hadoop MapR©MapR Technologies - Confidential   2
The Challenge     Hadoop is great of processing vats of data       –   But sucks for real-time (by design!)     Storm is...
This is not a problem.                                    It’s an opportunity!©MapR Technologies - Confidential           ...
Hadoop is Not Very Real-time                                            Unprocessed       now                             ...
Real-time and Long-time together                                                Blended       now                         ...
Rough Design – Data Flow              Search                Query Event                                     Query Event   ...
Guarantees     Counter output volume is small-ish       –   the greater of k tuples per 100K inputs or k tuple/s       – ...
Presentation Layer     Presentation must       –   read recent output of Logger bolt       –   read relevant output of Ha...
Example 2 – AB testing in real-time     I have 15 versions of my landing page     Each visitor is assigned to a version ...
A Quick Diversion     You see a coin       –   What is the probability of heads?       –   Could it be larger or smaller ...
A Philosophical Conclusion     Probability as expressed by humans is subjective and depends on      information and exper...
I Dunno©MapR Technologies - Confidential   13
5 heads out of 10 throws©MapR Technologies - Confidential   14
2 heads out of 12 throws©MapR Technologies - Confidential   15
Bayesian Bandit     Compute distributions based on data     Sample p1 and p2 from these distributions     Put a coin in...
And it works!                                    0.12                                    0.11                             ...
Video Demo©MapR Technologies - Confidential       18
The Code     Select an alternative                   n = dim(k)[1]                   p0 = rep(0, length.out=n)           ...
The Basic Idea     We can encode a distribution by sampling     Sampling allows unification of exploration and exploitat...
     Contact:       –   tdunning@maprtech.com       –   @ted_dunning     Slides and such (available late tonight):      ...
Thank You©MapR Technologies - Confidential   22
Upcoming SlideShare
Loading in...5
×

Buzz words 2012 - Real-time learning

3,553

Published on

This is the talk that I gave at Buzzwords. it is a heavily condensed version of the real-time learning talk I have given previously. The associated video is available at http://www.youtube.com/watch?v=1nQBnyaI6s0

Published in: Technology, Business
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,553
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
46
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide

Buzz words 2012 - Real-time learning

  1. 1. Real-time and Long-time with Storm and Hadoop©MapR Technologies - Confidential 1
  2. 2. Real-time and Long-time with Storm and Hadoop MapR©MapR Technologies - Confidential 2
  3. 3. The Challenge Hadoop is great of processing vats of data – But sucks for real-time (by design!) Storm is great for real-time processing – But lacks any way to deal with batch processing It sounds like there isn’t a solution – Neither fashionable solution handles everything©MapR Technologies - Confidential 3
  4. 4. This is not a problem. It’s an opportunity!©MapR Technologies - Confidential 4
  5. 5. Hadoop is Not Very Real-time Unprocessed now Data t Fully Latest full Hadoop job processed period takes this long for this data©MapR Technologies - Confidential 5
  6. 6. Real-time and Long-time together Blended now View view t Hadoop works Storm great back here works here©MapR Technologies - Confidential 6
  7. 7. Rough Design – Data Flow Search Query Event Query Event Counter Counter Logger Engine Spout Spout Bolt Bolt Bolt Logger Logger Bolt Semi Snap Bolt Agg Raw Hadoop Logs Aggregator Long agg©MapR Technologies - Confidential 7
  8. 8. Guarantees Counter output volume is small-ish – the greater of k tuples per 100K inputs or k tuple/s – 1 tuple/s/label/bolt for this exercise Persistence layer must provide guarantees – distributed against node failure – must have either readable flush or closed-append HDFS is distributed, but provides no guarantees and strange semantics MapRfs is distributed, provides all necessary guarantees©MapR Technologies - Confidential 8
  9. 9. Presentation Layer Presentation must – read recent output of Logger bolt – read relevant output of Hadoop jobs – combine semi-aggregated records User will see – counts that increment within 0-2 s of events – seamless meld of short and long-term data©MapR Technologies - Confidential 9
  10. 10. Example 2 – AB testing in real-time I have 15 versions of my landing page Each visitor is assigned to a version – Which version? A conversion or sale or whatever can happen – How long to wait? Some versions of the landing page are horrible – Don’t want to give them traffic©MapR Technologies - Confidential 10
  11. 11. A Quick Diversion You see a coin – What is the probability of heads? – Could it be larger or smaller than that? I flip the coin and while it is in the air ask again I catch the coin and ask again I look at the coin (and you don’t) and ask again Why does the answer change? – And did it ever have a single value?©MapR Technologies - Confidential 11
  12. 12. A Philosophical Conclusion Probability as expressed by humans is subjective and depends on information and experience©MapR Technologies - Confidential 12
  13. 13. I Dunno©MapR Technologies - Confidential 13
  14. 14. 5 heads out of 10 throws©MapR Technologies - Confidential 14
  15. 15. 2 heads out of 12 throws©MapR Technologies - Confidential 15
  16. 16. Bayesian Bandit Compute distributions based on data Sample p1 and p2 from these distributions Put a coin in bandit 1 if p1 > p2 Else, put the coin in bandit 2©MapR Technologies - Confidential 16
  17. 17. And it works! 0.12 0.11 0.1 0.09 0.08 0.07 regret 0.06 ε- greedy, ε = 0.05 0.05 0.04 Bayesian Bandit with Gam m a- Norm al 0.03 0.02 0.01 0 0 100 200 300 400 500 600 700 800 900 1000 1100 n©MapR Technologies - Confidential 17
  18. 18. Video Demo©MapR Technologies - Confidential 18
  19. 19. The Code Select an alternative n = dim(k)[1] p0 = rep(0, length.out=n) for (i in 1:n) { p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1) } return (which(p0 == max(p0))) Select and learn for (z in 1:steps) { i = select(k) j = test(i) k[i,j] = k[i,j]+1 } return (k) But we already know how to count!©MapR Technologies - Confidential 19
  20. 20. The Basic Idea We can encode a distribution by sampling Sampling allows unification of exploration and exploitation Can be extended to more general response models©MapR Technologies - Confidential 20
  21. 21.  Contact: – tdunning@maprtech.com – @ted_dunning Slides and such (available late tonight): – http://info.mapr.com/ted-bbuzz-2012 Hash tags: #mapr #bbuzz Collective notes: http://bit.ly/JDCRhc©MapR Technologies - Confidential 21
  22. 22. Thank You©MapR Technologies - Confidential 22
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×