Talk given by Ted Dunning at the London Hadoop users' group meeting in May of 2012 about how to do real-time and batch computation on the same stream of information.

  • MapR combines the best of the open source technology with our own deep innovations to provide the most advanced distribution for Apache Hadoop.MapR’s team has a deep bench of enterprise software experience with proven success across storage, networking, virtualization, analytics, and open source technologies.Our CEO has driven multiple companies to successful outcomes in the analytic, storage, and virtualization spaces.Our CTO and co-founder M.C. Srivas was most recently at Google in BigTable. He understands the challenges of MapReduce at huge scale. Srivas was also the chief software architect at Spinnaker Networks which came out of stealth with the fastest NAS storage on the market and was acquired quickly by NetAppThe team includes experience with enterprise storage at Cisco, VmWare, IBM and EMC. Our VP of Engineering led emerging technologies and a 600 person for EMC’s NAS engineering team. We also have experience in Business Intelligence and Analytic companies and open source committers in Hadoop, Zookeeper and Mahout including PMC members.MapR is proven technology with installs by leading Hadoop installations across industries and OEM by EMC and Cisco.
  • MapR’s innovations have also include expanding the Standards-based Interfaces. These innovations include comprehensive support for standard development tools, languages, and data access.
  • MapR provides a complete distribution for Apache Hadoop. MapR has integrated, tested and hardened a broad array of packages as part of this distribution Hive, Pig, Oozie, Sqoop, plus additional packages such as Cascading. We have spent over a two year well funded effort to provide deep architectural improvements to create the next generation distribution for Hadoop. MapR has made significant updates while providing a 100% compatible Hadoop for Apache distribution.This is in stark contrast with the alternative distributions from Cloudera, HortonWorks, Apache which are all equivalent.
    1. 1. 1©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop
    Contact: – – @ted_dunning  Slides and such: –  Hash tag: #mapr_uk Collective notes:
    4. 4. 4©MapR Technologies - Confidential Company Background  MapR provides the industry’s best Hadoop Distribution – Combines the best of the Hadoop community contributions with significant internally financed infrastructure development  Background of Team – Deep management bench with extensive analytic, storage, virtualization, and open source experience – Google, EMC, Cisco, VMWare, Network Appliance, IBM, Microsoft, Apache Foundation, Aster Data, Brio, ParAccel  Proven – MapR used across industries (Financial Services, Media, Telcom, Health Care, Internet Services, Government) – Strategic OEM relationship with EMC and Cisco – Over 1,000 installs
    5. 5. 5©MapR Technologies - Confidential Expanding Hadoop Use Cases NFS for file- based applications Hadoop APIs for Hadoop Applications ODBC (JDBC) for SQL-based applications Blue = MapR Innovations Real-time Applications Mission Critical and SLA dependent Applications
    6. 6. 6©MapR Technologies - Confidential MapR’s Complete Distribution for Apache Hadoop MapR Heatmap™ LDAP, NIS Integration Quotas, Alerts, Alarms CLI, REST APT Hive Pig Oozle Sqoop HBase Whirr Mahout Cascading Naglos Integration Ganglia Integration Flume Zoo- keeper MapR Control System Direct Access NFS Real-Time Streaming Volumes Mirrors Snap- shots Data Placement No NameNode Architecture High Performance Direct Shuffle Stateful Failover and Self Healing 2.7MapR’s Storage Services™  Integrated, tested, hardened and Supported  100% Hadoop, HBase, HDFS API compatible  Easy portability/ migration between distributions  Unique advanced features  No changes required to Hadoop applications  Runs on commodity hardware
    7. 7. 7©MapR Technologies - Confidential So what about that real-time stuff?
    8. 8. 8©MapR Technologies - Confidential The Challenge  Hadoop is great of processing vats of data – But sucks for real-time (by design!)  Storm is great for real-time processing – But lacks any way to deal with batch processing  It sounds like there isn’t a solution – Neither fashionable solution handles everything
    9. 9. 9©MapR Technologies - Confidential This is not a problem. It’s an opportunity!
    10. 10. 10©MapR Technologies - Confidential t now Hadoop is Not Very Real-time Unprocessed Data Fully processed Latest full period Hadoop job takes this long for this data
    11. 11. 11©MapR Technologies - Confidential Need to Plug the Hole in Hadoop  We have real-time data with limited state – Exactly what Storm does – And what Hadoop does not  We also have long-term analytics with lots of state – Exactly what Hadoop does – And what Storm does not  Can Storm and Hadoop be combined?
    12. 12. 12©MapR Technologies - Confidential t now Hadoop works great back here Storm works here Real-time and Long-time together Blended view Blended view Blended View
    13. 13. 13©MapR Technologies - Confidential An Example  I want to know how many queries I get – Per second, minute, day, week  Results should be available – within <2 seconds 99.9+% of the time – within 30 seconds almost always  History should last >3 years  Should work for 0.001 q/s up to 100,000 q/s  Failure tolerant, yadda, yadda
    14. 14. 14©MapR Technologies - Confidential Rough Design – Data Flow Search Engine Query Event Spout Logger Bolt Counter Bolt Raw Logs Logger Bolt Semi Agg Hadoop Aggregator Snap Long agg Query Event Spout Counter Bolt Logger Bolt
    15. 15. 15©MapR Technologies - Confidential Counter Bolt Detail  Input: Labels to count  Output: Short-term semi-aggregated counts – (time-window, label, count)  Input is logged until next flush  Non-zero counts emitted on flush if – event count reaches threshold (typical 100K) – time since last count reaches threshold (typical 1-10s)  Tuples acked when counts emitted  Double count probability is > 0 but very small
    16. 16. 16©MapR Technologies - Confidential Counter Bolt Counterintuitivity  Counts are emitted for same label, same time window many times – these are semi-aggregated – this is a feature – tuples can be acked within 1s – time windows can be much longer than 1s  No need to send same label to same bolt – speeds failure recovery
    17. 17. 17©MapR Technologies - Confidential Design Flexibility  Tuples can be ack’ed as soon as they hit the log – counter can recover state on failure – log is burn after write  Count flush interval can be extended without extending tuple timeout – Decreases currency of counts in semi-aggregates  Total bandwidth for log is typically not huge – All of twitter @10,000 messages per second = 10K x 2KB = 20MB/s
    18. 18. 18©MapR Technologies - Confidential Counter Bolt No-nos  Cannot accumulate entire period in-memory – Tuples must be ack’ed much sooner – State must be persisted before ack’ing – State can easily grow too large to handle without disk access  Cannot persist entire count table at once – Incremental persistence required
    19. 19. 19©MapR Technologies - Confidential Guarantees  Counter output volume is small-ish – the greater of k tuples per 100K inputs or k tuple/s – 1 tuple/s/label/bolt for this exercise  Persistence layer must provide guarantees – distributed against node failure – must have either readable flush or closed-append  HDFS is distributed, but provides no guarantees and strange semantics  MapRfs is distributed, provides all necessary guarantees
    20. 20. 20©MapR Technologies - Confidential Failure Modes  Bolt failure – buffered tuples will go un’acked – after timeout, tuples will be resent – timeout ≈ 10s – if failure occurs after persistence, before acking, then double-counting is possible  Storage (with MapR) – most failures invisible – a few continue within 0-2s, some take 10s – catastrophic cluster restart can take 2-3 min – logger can buffer this much easily
    21. 21. 21©MapR Technologies - Confidential Presentation Layer  Presentation must – read recent output of Logger bolt – read relevant output of Hadoop jobs – combine semi-aggregated records  User will see – counts that increment within 0-2 s of events – seamless meld of short and long-term data
    22. 22. 22©MapR Technologies - Confidential Example 2 – Real-time learning  My system has to – learn a response model and – select training data – in real-time  Data rate up to 100K queries per second
    23. 23. 23©MapR Technologies - Confidential Door Number 3 – AB testing in real-time  I have 15 versions of my landing page  Each visitor is assigned to a version – Which version?  A conversion or sale or whatever can happen – How long to wait?  Some versions of the landing page are horrible – Don’t want to give them traffic
    24. 24. 24©MapR Technologies - Confidential Real-time Constraints  Selection must happen in <20 ms almost all the time  Training events must be handled in <20 ms  Failover must happen within 5 seconds  Client should timeout and back-off – no need for an answer after 500ms  State persistence required
    25. 25. 25©MapR Technologies - Confidential Rough Design DRPC Spout Query Event Spout Logger Bolt Counter Bolt Raw Logs Model State Timed Join Model Logger Bolt Conversion Detector Selector Layer
    26. 26. 26©MapR Technologies - Confidential A Quick Diversion  You see a coin – What is the probability of heads? – Could it be larger or smaller than that?  I flip the coin and while it is in the air ask again  I catch the coin and ask again  I look at the coin (and you don’t) and ask again  Why does the answer change? – And did it ever have a single value?
    27. 27. 27©MapR Technologies - Confidential A First Conclusion  Probability as expressed by humans is subjective and depends on information and experience
    28. 28. 28©MapR Technologies - Confidential A Second Diversion  What is the mass of the moon? – 1/2 degree @ 385 Mm = ~ 3.8 Mm diameter (really about 3.4-ish) – V = 1/6 x pi x 3.83 x 1018 m3 = ~ 29 x 1018 m3 (really about 22) – m = rho V = 4 Mg/m3 x 29 x 1018 m3 = 1.2 x 1023 kg (really about 0.7)  Is that the exact number? – Shouldn’t we have confidence bounds?  Wikipedia says: 7.3477 × 1022 kg – Is that the exact number? – Shouldn’t they have confidence bounds?
    29. 29. 29©MapR Technologies - Confidential A Second Conclusion  A single number is a bad way to express uncertain knowledge  A distribution of values might be better
    30. 30. 30©MapR Technologies - Confidential I Dunno
    31. 31. 31©MapR Technologies - Confidential 5 and 5
    32. 32. 32©MapR Technologies - Confidential 2 and 10
    33. 33. 33©MapR Technologies - Confidential Bayesian Bandit  Compute distributions based on data  Sample p1 and p2 from these distributions  Put a coin in bandit 1 if p1 > p2  Else, put the coin in bandit 2
    34. 34. 34©MapR Technologies - Confidential And it works! 11000 100 200 300 400 500 600 700 800 900 1000 0.12 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 n regret ε- greedy, ε = 0.05 Bayesian Bandit with Gamma- Normal
    35. 35. 35©MapR Technologies - Confidential Video Demo
    36. 36. 36©MapR Technologies - Confidential The Code  Select an alternative  Select and learn  But we already know how to count! n = dim(k)[1] p0 = rep(0, length.out=n) for (i in 1:n) { p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1) } return (which(p0 == max(p0))) for (z in 1:steps) { i = select(k) j = test(i) k[i,j] = k[i,j]+1 } return (k)
    37. 37. 37©MapR Technologies - Confidential The Basic Idea  We can encode a distribution by sampling  Sampling allows unification of exploration and exploitation  Can be extended to more general response models
