Your SlideShare is downloading. ×
0
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Real-Enough-Time Data Processing - CHUG - 2012024
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Real-Enough-Time Data Processing - CHUG - 2012024

675

Published on

View the accompanying video on vimeo: https://vimeo.com/45547604

View the accompanying video on vimeo: https://vimeo.com/45547604

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
675
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Real-time and long-time Fun with Hadoop + Storm
  • 2. The Challenge• Hadoop is great of processing vats of data – But sucks for real-time (by design!)• Storm is great for real-time processing – But lacks any way to deal with batch processing• It sounds like there isn’t a solution – Neither fashionable solution handles everything
  • 3. This is not a problem.It’s an opportunity!
  • 4. Hadoop is Not Very Real-time Unprocessed now Data t Fully Latest full Hadoop job processed period takes this long for this data
  • 5. Need to Plug the Hole in Hadoop• We have real-time data with limited state – Exactly what Storm does – And what Hadoop does not• Can Storm and Hadoop be combined?
  • 6. Real-time and Long-time together Blended now View view t Hadoop works Storm great back here works here
  • 7. An Example• I want to know how many queries I get – Per second, minute, day, week• Results should be available – within <2 seconds 99.9+% of the time – within 30 seconds almost always• History should last >3 years• Should work for 0.001 q/s up to 100,000 q/s• Failure tolerant, yadda, yadda
  • 8. Rough Design – Data FlowSearch Query Event Query Event Counter Counter LoggerEngine Spout Spout Bolt Bolt Bolt Logger Logger Bolt Semi Snap Bolt Agg Raw Hadoop Logs Aggregator Long agg
  • 9. Counter Bolt Detail• Input: Labels to count• Output: Short-term semi-aggregated counts – (time-window, label, count)• Non-zero counts emitted if – event count reaches threshold (typical 100K) – time since last count reaches threshold (typical 1s)• Tuples acked when counts emitted• Double count probability is > 0 but very small
  • 10. Counter Bolt Counterintuitivity• Counts are emitted for same label, same time window many times – these are semi-aggregated – this is a feature – tuples can be acked within 1s – time windows can be much longer than 1s• No need to send same label to same bolt – speeds failure recovery
  • 11. Design Flexibility• Counter can persist short-term transaction log – counter can recover state on failure – log is normally burn after write• Count flush interval can be extended without extending tuple timeout – Decreases currency of persisted counts – System is still real-time at a longer time-scale• Total bandwidth for log is typically not huge
  • 12. Counter Bolt No-nos• Cannot accumulate entire period in-memory – Tuples must be ack’ed much sooner – State must be persisted before ack’ing – State can easily grow too large to handle without disk access• Cannot persist entire count table at once – Incremental persistence required
  • 13. Guarantees• Counter output volume is small-ish – the greater of k tuples per 100K inputs or k tuple/s – 1 tuple/s/label/bolt for this exercise• Persistence layer must provide guarantees – distributed against node failure – must have either readable flush or closed-append – HDFS is distributed, but no guarantees – MapRfs is distributed, provides both guarantees
  • 14. Failure Modes• Bolt failure – buffered tuples will go un’acked – after timeout, tuples will be resent – timeout ≈ 10s – if failure occurs after persistence, before acking, then double-counting is possible• Storage (with MapR) – most failures invisible – a few continue within 0-2s, some take 10s – catastrophic cluster restart can take 2-3 min – logger can buffer this much easily
  • 15. Presentation Layer• Presentation must – read recent output of Logger bolt – read relevant output of Hadoop jobs – combine semi-aggregated records• User will see – counts that increment within 0-2 s of events – seamless meld of short and long-term data
  • 16. Mobile Network Monitor Transaction dataGeo-dispersed ingest servers Batch aggregation Retro-analysis interface Map Real-time dashboard and alerts 16
  • 17. Example 2 – Real-time learning• My system has to – learn a response model and – select training data – in real-time• Data rate up to 100K queries per second
  • 18. Door Number 3• I have 15 versions of my landing page• Each visitor is assigned to a version – Which version?• A conversion or sale or whatever can happen – How long to wait?• Some versions of the landing page are horrible – Don’t want to give them traffic
  • 19. Real-time Constraints• Selection must happen in <1 ms almost all the time• Training events must be handled in <20 ms• Failover must happen within 5 seconds• Client should timeout and back-off – no need for an answer after 500ms• State persistence required
  • 20. Rough Design Selector Query Event Counter DRPC Spout Timed Join Model Layer Spout BoltConversion Logger Logger Model Detector Bolt Bolt State Raw Logs
  • 21. A Quick Diversion• You see a coin – What is the probability of heads? – Could it be larger or smaller than that?• I flip the coin and while it is in the air ask again• I catch the coin and ask again• I look at the coin (and you don’t) and ask again• Why does the answer change? – And did it ever have a single value?
  • 22. A First Conclusion• Probability as expressed by humans is subjective and depends on information and experience
  • 23. A Second Conclusion• A single number is a bad way to express uncertain knowledge• A distribution of values might be better
  • 24. I Dunno
  • 25. 5 and 5
  • 26. 2 and 10
  • 27. Bayesian Bandit• Compute distributions based on data• Sample p1 and p2 from these distributions• Put a coin in bandit 1 if p1 > p2• Else, put the coin in bandit 2
  • 28. And it works! 0.12 0.11 0.1 0.09 0.08 0.07regret 0.06 ε- greedy, ε = 0.05 0.05 0.04 Bayesian Bandit with Gam m a- Norm al 0.03 0.02 0.01 0 0 100 200 300 400 500 600 700 800 900 1000 1100 n
  • 29. The Basic Idea• We can encode a distribution by sampling• After sampling, all code can be deterministic• Sampling allows unification of exploration and exploitation• Can be extended to more general response models
  • 30. • Contact: – tdunning@maprtech.com – @ted_dunning• Slides and such: – http://info.mapr.com/ted-chicago-hug-4-23-12.html

×