Storm 2012 03-29
Upcoming SlideShare
Loading in...5
×
 

Storm 2012 03-29

on

  • 236 views

Detailed design for a robust counter as well as design for a completely on-line multi-armed bandit implementation that uses the new Bayesian Bandit algorithm - by Ted Dunning

Detailed design for a robust counter as well as design for a completely on-line multi-armed bandit implementation that uses the new Bayesian Bandit algorithm - by Ted Dunning

Statistics

Views

Total Views
236
Views on SlideShare
235
Embed Views
1

Actions

Likes
1
Downloads
5
Comments
0

1 Embed 1

http://dschool.co 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Storm 2012 03-29 Storm 2012 03-29 Presentation Transcript

  • Real-time and long-time Fun with Hadoop + Storm
  • The Challenge • Hadoop is great of processing vats of data – But sucks for real-time (by design!) • Storm is great for real-time processing – But lacks any way to deal with batch processing • It sounds like there isn’t a solution – Neither fashionable solution handles everything
  • This is not a problem. It’s an opportunity! View slide
  • t now Hadoop is Not Very Real-time Unprocessed Data Fully processed Latest full period Hadoop job takes this long for this data View slide
  • Need to Plug the Hole in Hadoop • We have real-time data with limited state – Exactly what Storm does – And what Hadoop does not • Can Storm and Hadoop be combined?
  • t now Hadoop works great back here Storm works here Real-time and Long-time together Blended view Blended view Blended View
  • An Example • I want to know how many queries I get – Per second, minute, day, week • Results should be available – within <2 seconds 99.9+% of the time – within 30 seconds almost always • History should last >3 years • Should work for 0.001 q/s up to 100,000 q/s • Failure tolerant, yadda, yadda
  • Rough Design – Data Flow Search Engine Query Event Spout Logger Bolt Counter Bolt Raw Logs Logger Bolt Semi Agg Hadoop Aggregator Snap Long agg Query Event Spout Counter Bolt Logger Bolt
  • Counter Bolt Detail • Input: Labels to count • Output: Short-term semi-aggregated counts – (time-window, label, count) • Non-zero counts emitted if – event count reaches threshold (typical 100K) – time since last count reaches threshold (typical 1s) • Tuples acked when counts emitted • Double count probability is > 0 but very small
  • Counter Bolt Counterintuitivity • Counts are emitted for same label, same time window many times – these are semi-aggregated – this is a feature – tuples can be acked within 1s – time windows can be much longer than 1s • No need to send same label to same bolt – speeds failure recovery
  • Design Flexibility • Counter can persist short-term transaction log – counter can recover state on failure – log is normally burn after write • Count flush interval can be extended without extending tuple timeout – Decreases currency of counts – System is still real-time at a longer time-scale • Total bandwidth for log is typically not huge
  • Counter Bolt No-nos • Cannot accumulate entire period in-memory – Tuples must be ack’ed much sooner – State must be persisted before ack’ing – State can easily grow too large to handle without disk access • Cannot persist entire count table at once – Incremental persistence required
  • Guarantees • Counter output volume is small-ish – the greater of k tuples per 100K inputs or k tuple/s – 1 tuple/s/label/bolt for this exercise • Persistence layer must provide guarantees – distributed against node failure – must have either readable flush or closed-append – HDFS is distributed, but no guarantees – MapRfs is distributed, provides both guarantees
  • Failure Modes • Bolt failure – buffered tuples will go un’acked – after timeout, tuples will be resent – timeout ≈ 10s – if failure occurs after persistence, before acking, then double-counting is possible • Storage (with MapR) – most failures invisible – a few continue within 0-2s, some take 10s – catastrophic cluster restart can take 2-3 min – logger can buffer this much easily
  • Presentation Layer • Presentation must – read recent output of Logger bolt – read relevant output of Hadoop jobs – combine semi-aggregated records • User will see – counts that increment within 0-2 s of events – seamless meld of short and long-term data
  • Mobile Network Monitor 16 Transaction data Batch aggregation Map Real-time dashboard and alerts Geo-dispersed ingest servers Retro-analysis interface
  • Example 2 – Real-time learning • My system has to – learn a response model and – select training data – in real-time • Data rate up to 100K queries per second
  • Door Number 3 • I have 15 versions of my landing page • Each visitor is assigned to a version – Which version? • A conversion or sale or whatever can happen – How long to wait? • Some versions of the landing page are horrible – Don’t want to give them traffic
  • Real-time Constraints • Selection must happen in <20 ms almost all the time • Training events must be handled in <20 ms • Failover must happen within 5 seconds • Client should timeout and back-off – no need for an answer after 500ms • State persistence required
  • Rough Design DRPC Spout Query Event Spout Logger Bolt Counter Bolt Raw Logs Model State Timed Join Model Logger Bolt Conversion Detector Selector Layer
  • A Quick Diversion • You see a coin – What is the probability of heads? – Could it be larger or smaller than that? • I flip the coin and while it is in the air ask again • I catch the coin and ask again • I look at the coin (and you don’t) and ask again • Why does the answer change? – And did it ever have a single value?
  • A First Conclusion • Probability as expressed by humans is subjective and depends on information and experience
  • A Second Conclusion • A single number is a bad way to express uncertain knowledge • A distribution of values might be better
  • I Dunno
  • 5 and 5
  • 2 and 10
  • Bayesian Bandit • Compute distributions based on data • Sample p1 and p2 from these distributions • Put a coin in bandit 1 if p1 > p2 • Else, put the coin in bandit 2
  • And it works! 11000 100 200 300 400 500 600 700 800 900 1000 0.12 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 n regret ε- greedy, ε = 0.05 Bayesian Bandit with Gamma- Normal
  • The Basic Idea • We can encode a distribution by sampling • Sampling allows unification of exploration and exploitation • Can be extended to more general response models
  • • Contact: – tdunning@maprtech.com – @ted_dunning • Slides and such: – http://info.mapr.com/ted-storm-2012-03