Real-time and long-time
Fun with Hadoop + Storm
The Challenge
• Hadoop is great of processing vats of data
– But sucks for real-time (by design!)
• Storm is great for rea...
This is not a problem.
It’s an opportunity!
t
now
Hadoop is Not Very Real-time
Unprocessed
Data
Fully
processed
Latest full
period
Hadoop job
takes this
long for this...
Need to Plug the Hole in Hadoop
• We have real-time data with limited state
– Exactly what Storm does
– And what Hadoop do...
t
now
Hadoop works
great back here
Storm
works
here
Real-time and Long-time together
Blended
view
Blended
view
Blended
View
An Example
• I want to know how many queries I get
– Per second, minute, day, week
• Results should be available
– within ...
Rough Design – Data Flow
Search
Engine
Query Event
Spout
Logger
Bolt
Counter
Bolt
Raw
Logs
Logger
Bolt
Semi
Agg
Hadoop
Agg...
Counter Bolt Detail
• Input: Labels to count
• Output: Short-term semi-aggregated counts
– (time-window, label, count)
• N...
Counter Bolt Counterintuitivity
• Counts are emitted for same label, same time
window many times
– these are semi-aggregat...
Design Flexibility
• Counter can persist short-term transaction log
– counter can recover state on failure
– log is normal...
Counter Bolt No-nos
• Cannot accumulate entire period in-memory
– Tuples must be ack’ed much sooner
– State must be persis...
Guarantees
• Counter output volume is small-ish
– the greater of k tuples per 100K inputs or k tuple/s
– 1 tuple/s/label/b...
Failure Modes
• Bolt failure
– buffered tuples will go un’acked
– after timeout, tuples will be resent
– timeout ≈ 10s
– i...
Presentation Layer
• Presentation must
– read recent output of Logger bolt
– read relevant output of Hadoop jobs
– combine...
Mobile Network Monitor
16
Transaction
data
Batch aggregation
Map
Real-time dashboard
and alerts
Geo-dispersed
ingest serve...
Example 2 – Real-time learning
• My system has to
– learn a response model
and
– select training data
– in real-time
• Dat...
Door Number 3
• I have 15 versions of my landing page
• Each visitor is assigned to a version
– Which version?
• A convers...
Real-time Constraints
• Selection must happen in <20 ms almost all
the time
• Training events must be handled in <20 ms
• ...
Rough Design
DRPC Spout
Query Event
Spout
Logger
Bolt
Counter
Bolt
Raw
Logs
Model
State
Timed Join Model
Logger
Bolt
Conve...
A Quick Diversion
• You see a coin
– What is the probability of heads?
– Could it be larger or smaller than that?
• I flip...
A First Conclusion
• Probability as expressed by humans is
subjective and depends on information and
experience
A Second Conclusion
• A single number is a bad way to express
uncertain knowledge
• A distribution of values might be bett...
I Dunno
5 and 5
2 and 10
Bayesian Bandit
• Compute distributions based on data
• Sample p1 and p2 from these distributions
• Put a coin in bandit 1...
And it works!
11000 100 200 300 400 500 600 700 800 900 1000
0.12
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
...
The Basic Idea
• We can encode a distribution by sampling
• Sampling allows unification of exploration and
exploitation
• ...
• Contact:
– tdunning@maprtech.com
– @ted_dunning
• Slides and such:
– http://info.mapr.com/ted-storm-2012-03
Upcoming SlideShare
Loading in...5
×

Storm 2012 03-29

189

Published on

Detailed design for a robust counter as well as design for a completely on-line multi-armed bandit implementation that uses the new Bayesian Bandit algorithm - by Ted Dunning

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
189
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Storm 2012 03-29

  1. 1. Real-time and long-time Fun with Hadoop + Storm
  2. 2. The Challenge • Hadoop is great of processing vats of data – But sucks for real-time (by design!) • Storm is great for real-time processing – But lacks any way to deal with batch processing • It sounds like there isn’t a solution – Neither fashionable solution handles everything
  3. 3. This is not a problem. It’s an opportunity!
  4. 4. t now Hadoop is Not Very Real-time Unprocessed Data Fully processed Latest full period Hadoop job takes this long for this data
  5. 5. Need to Plug the Hole in Hadoop • We have real-time data with limited state – Exactly what Storm does – And what Hadoop does not • Can Storm and Hadoop be combined?
  6. 6. t now Hadoop works great back here Storm works here Real-time and Long-time together Blended view Blended view Blended View
  7. 7. An Example • I want to know how many queries I get – Per second, minute, day, week • Results should be available – within <2 seconds 99.9+% of the time – within 30 seconds almost always • History should last >3 years • Should work for 0.001 q/s up to 100,000 q/s • Failure tolerant, yadda, yadda
  8. 8. Rough Design – Data Flow Search Engine Query Event Spout Logger Bolt Counter Bolt Raw Logs Logger Bolt Semi Agg Hadoop Aggregator Snap Long agg Query Event Spout Counter Bolt Logger Bolt
  9. 9. Counter Bolt Detail • Input: Labels to count • Output: Short-term semi-aggregated counts – (time-window, label, count) • Non-zero counts emitted if – event count reaches threshold (typical 100K) – time since last count reaches threshold (typical 1s) • Tuples acked when counts emitted • Double count probability is > 0 but very small
  10. 10. Counter Bolt Counterintuitivity • Counts are emitted for same label, same time window many times – these are semi-aggregated – this is a feature – tuples can be acked within 1s – time windows can be much longer than 1s • No need to send same label to same bolt – speeds failure recovery
  11. 11. Design Flexibility • Counter can persist short-term transaction log – counter can recover state on failure – log is normally burn after write • Count flush interval can be extended without extending tuple timeout – Decreases currency of counts – System is still real-time at a longer time-scale • Total bandwidth for log is typically not huge
  12. 12. Counter Bolt No-nos • Cannot accumulate entire period in-memory – Tuples must be ack’ed much sooner – State must be persisted before ack’ing – State can easily grow too large to handle without disk access • Cannot persist entire count table at once – Incremental persistence required
  13. 13. Guarantees • Counter output volume is small-ish – the greater of k tuples per 100K inputs or k tuple/s – 1 tuple/s/label/bolt for this exercise • Persistence layer must provide guarantees – distributed against node failure – must have either readable flush or closed-append – HDFS is distributed, but no guarantees – MapRfs is distributed, provides both guarantees
  14. 14. Failure Modes • Bolt failure – buffered tuples will go un’acked – after timeout, tuples will be resent – timeout ≈ 10s – if failure occurs after persistence, before acking, then double-counting is possible • Storage (with MapR) – most failures invisible – a few continue within 0-2s, some take 10s – catastrophic cluster restart can take 2-3 min – logger can buffer this much easily
  15. 15. Presentation Layer • Presentation must – read recent output of Logger bolt – read relevant output of Hadoop jobs – combine semi-aggregated records • User will see – counts that increment within 0-2 s of events – seamless meld of short and long-term data
  16. 16. Mobile Network Monitor 16 Transaction data Batch aggregation Map Real-time dashboard and alerts Geo-dispersed ingest servers Retro-analysis interface
  17. 17. Example 2 – Real-time learning • My system has to – learn a response model and – select training data – in real-time • Data rate up to 100K queries per second
  18. 18. Door Number 3 • I have 15 versions of my landing page • Each visitor is assigned to a version – Which version? • A conversion or sale or whatever can happen – How long to wait? • Some versions of the landing page are horrible – Don’t want to give them traffic
  19. 19. Real-time Constraints • Selection must happen in <20 ms almost all the time • Training events must be handled in <20 ms • Failover must happen within 5 seconds • Client should timeout and back-off – no need for an answer after 500ms • State persistence required
  20. 20. Rough Design DRPC Spout Query Event Spout Logger Bolt Counter Bolt Raw Logs Model State Timed Join Model Logger Bolt Conversion Detector Selector Layer
  21. 21. A Quick Diversion • You see a coin – What is the probability of heads? – Could it be larger or smaller than that? • I flip the coin and while it is in the air ask again • I catch the coin and ask again • I look at the coin (and you don’t) and ask again • Why does the answer change? – And did it ever have a single value?
  22. 22. A First Conclusion • Probability as expressed by humans is subjective and depends on information and experience
  23. 23. A Second Conclusion • A single number is a bad way to express uncertain knowledge • A distribution of values might be better
  24. 24. I Dunno
  25. 25. 5 and 5
  26. 26. 2 and 10
  27. 27. Bayesian Bandit • Compute distributions based on data • Sample p1 and p2 from these distributions • Put a coin in bandit 1 if p1 > p2 • Else, put the coin in bandit 2
  28. 28. And it works! 11000 100 200 300 400 500 600 700 800 900 1000 0.12 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 n regret ε- greedy, ε = 0.05 Bayesian Bandit with Gamma- Normal
  29. 29. The Basic Idea • We can encode a distribution by sampling • Sampling allows unification of exploration and exploitation • Can be extended to more general response models
  30. 30. • Contact: – tdunning@maprtech.com – @ted_dunning • Slides and such: – http://info.mapr.com/ted-storm-2012-03
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×