London hug
Upcoming SlideShare
Loading in...5
×
 

London hug

on

  • 1,675 views

Talk given at the London Hadoop users' group meeting in May of 2012 about how to do real-time and batch computation on the same stream of information.

Talk given at the London Hadoop users' group meeting in May of 2012 about how to do real-time and batch computation on the same stream of information.

Statistics

Views

Total Views
1,675
Views on SlideShare
1,663
Embed Views
12

Actions

Likes
1
Downloads
36
Comments
0

3 Embeds 12

http://localhost 7
http://www.linkedin.com 4
http://dschool.co 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • MapR combines the best of the open source technology with our own deep innovations to provide the most advanced distribution for Apache Hadoop.MapR’s team has a deep bench of enterprise software experience with proven success across storage, networking, virtualization, analytics, and open source technologies.Our CEO has driven multiple companies to successful outcomes in the analytic, storage, and virtualization spaces.Our CTO and co-founder M.C. Srivas was most recently at Google in BigTable. He understands the challenges of MapReduce at huge scale. Srivas was also the chief software architect at Spinnaker Networks which came out of stealth with the fastest NAS storage on the market and was acquired quickly by NetAppThe team includes experience with enterprise storage at Cisco, VmWare, IBM and EMC. Our VP of Engineering led emerging technologies and a 600 person for EMC’s NAS engineering team. We also have experience in Business Intelligence and Analytic companies and open source committers in Hadoop, Zookeeper and Mahout including PMC members.MapR is proven technology with installs by leading Hadoop installations across industries and OEM by EMC and Cisco.
  • MapR’s innovations have also include expanding the Standards-based Interfaces. These innovations include comprehensive support for standard development tools, languages, and data access.
  • MapR provides a complete distribution for Apache Hadoop. MapR has integrated, tested and hardened a broad array of packages as part of this distribution Hive, Pig, Oozie, Sqoop, plus additional packages such as Cascading. We have spent over a two year well funded effort to provide deep architectural improvements to create the next generation distribution for Hadoop. MapR has made significant updates while providing a 100% compatible Hadoop for Apache distribution.This is in stark contrast with the alternative distributions from Cloudera, HortonWorks, Apache which are all equivalent.

London hug London hug Presentation Transcript

  • Real-time and Long-time with Storm and Hadoop©MapR Technologies - Confidential 1
  • Real-time and Long-time with Storm and Hadoop MapR©MapR Technologies - Confidential 2
  •  Contact: – tdunning@maprtech.com – @ted_dunning Slides and such: – http://info.mapr.com/ted-uk-05-2012 Hash tag: #mapr_uk Collective notes: http://bit.ly/JDCRhc©MapR Technologies - Confidential 3
  • Company Background MapR provides the industry’s best Hadoop Distribution – Combines the best of the Hadoop community contributions with significant internally financed infrastructure development Background of Team – Deep management bench with extensive analytic, storage, virtualization, and open source experience – Google, EMC, Cisco, VMWare, Network Appliance, IBM, Microsoft, Apache Foundation, Aster Data, Brio, ParAccel Proven – MapR used across industries (Financial Services, Media, Telcom, Health Care, Internet Services, Government) – Strategic OEM relationship with EMC and Cisco – Over 1,000 installs ©MapR Technologies - Confidential 4
  • Expanding Hadoop Use Cases Hadoop APIs for Hadoop Applications NFS for file- ODBC (JDBC) based for SQL-based applications applications Mission Real-time Critical and SLA Applications dependent Applications Blue = MapR Innovations©MapR Technologies - Confidential 5
  • MapR’s Complete Distribution for Apache Hadoop Integrated, tested, hardened and MapR Control System Supported MapR LDAP, NIS Quotas, CLI, Heatmap™ Integration Alerts, Alarms REST APT 100% Hadoop, HBase, HDFS API compatible Hive Pig Oozle Sqoop HBase Whirr Easy portability/ migration between Zoo- Mahout Cascading Naglos Ganglia Flume distributions Integration Integration keeper Unique advanced features No changes required Direct Real-Time Snap- Data Access Streaming Volumes Mirrors shots Placement to Hadoop applications NFS Runs on commodity No NameNode High Performance Stateful Failover Architecture Direct Shuffle and Self Healing hardware 2.7 MapR’s Storage Services™ ©MapR Technologies - Confidential 6
  • So what about that real-time stuff?©MapR Technologies - Confidential 7
  • The Challenge Hadoop is great of processing vats of data – But sucks for real-time (by design!) Storm is great for real-time processing – But lacks any way to deal with batch processing It sounds like there isn’t a solution – Neither fashionable solution handles everything©MapR Technologies - Confidential 8
  • This is not a problem. It’s an opportunity!©MapR Technologies - Confidential 9
  • Hadoop is Not Very Real-time Unprocessed now Data t Fully Latest full Hadoop job processed period takes this long for this data©MapR Technologies - Confidential 10
  • Need to Plug the Hole in Hadoop We have real-time data with limited state – Exactly what Storm does – And what Hadoop does not We also have long-term analytics with lots of state – Exactly what Hadoop does – And what Storm does not Can Storm and Hadoop be combined?©MapR Technologies - Confidential 11
  • Real-time and Long-time together Blended now View view t Hadoop works Storm great back here works here©MapR Technologies - Confidential 12
  • An Example I want to know how many queries I get – Per second, minute, day, week Results should be available – within <2 seconds 99.9+% of the time – within 30 seconds almost always History should last >3 years Should work for 0.001 q/s up to 100,000 q/s Failure tolerant, yadda, yadda©MapR Technologies - Confidential 13
  • Rough Design – Data Flow Search Query Event Query Event Counter Counter Logger Engine Spout Spout Bolt Bolt Bolt Logger Logger Bolt Semi Snap Bolt Agg Raw Hadoop Logs Aggregator Long agg©MapR Technologies - Confidential 14
  • Counter Bolt Detail Input: Labels to count Output: Short-term semi-aggregated counts – (time-window, label, count) Input is logged until next flush Non-zero counts emitted on flush if – event count reaches threshold (typical 100K) – time since last count reaches threshold (typical 1-10s) Tuples acked when counts emitted Double count probability is > 0 but very small©MapR Technologies - Confidential 15
  • Counter Bolt Counterintuitivity Counts are emitted for same label, same time window many times – these are semi-aggregated – this is a feature – tuples can be acked within 1s – time windows can be much longer than 1s No need to send same label to same bolt – speeds failure recovery©MapR Technologies - Confidential 16
  • Design Flexibility Tuples can be ack’ed as soon as they hit the log – counter can recover state on failure – log is burn after write Count flush interval can be extended without extending tuple timeout – Decreases currency of counts in semi-aggregates Total bandwidth for log is typically not huge – All of twitter @10,000 messages per second = 10K x 2KB = 20MB/s©MapR Technologies - Confidential 17
  • Counter Bolt No-nos Cannot accumulate entire period in-memory – Tuples must be ack’ed much sooner – State must be persisted before ack’ing – State can easily grow too large to handle without disk access Cannot persist entire count table at once – Incremental persistence required©MapR Technologies - Confidential 18
  • Guarantees Counter output volume is small-ish – the greater of k tuples per 100K inputs or k tuple/s – 1 tuple/s/label/bolt for this exercise Persistence layer must provide guarantees – distributed against node failure – must have either readable flush or closed-append HDFS is distributed, but provides no guarantees and strange semantics MapRfs is distributed, provides all necessary guarantees©MapR Technologies - Confidential 19
  • Failure Modes Bolt failure – buffered tuples will go un’acked – after timeout, tuples will be resent – timeout ≈ 10s – if failure occurs after persistence, before acking, then double-counting is possible Storage (with MapR) – most failures invisible – a few continue within 0-2s, some take 10s – catastrophic cluster restart can take 2-3 min – logger can buffer this much easily©MapR Technologies - Confidential 20
  • Presentation Layer Presentation must – read recent output of Logger bolt – read relevant output of Hadoop jobs – combine semi-aggregated records User will see – counts that increment within 0-2 s of events – seamless meld of short and long-term data©MapR Technologies - Confidential 21
  • Example 2 – Real-time learning My system has to – learn a response model and – select training data – in real-time Data rate up to 100K queries per second©MapR Technologies - Confidential 22
  • Door Number 3 – AB testing in real-time I have 15 versions of my landing page Each visitor is assigned to a version – Which version? A conversion or sale or whatever can happen – How long to wait? Some versions of the landing page are horrible – Don’t want to give them traffic©MapR Technologies - Confidential 23
  • Real-time Constraints Selection must happen in <20 ms almost all the time Training events must be handled in <20 ms Failover must happen within 5 seconds Client should timeout and back-off – no need for an answer after 500ms State persistence required©MapR Technologies - Confidential 24
  • Rough Design Selector Query Event Counter DRPC Spout Timed Join Model Layer Spout Bolt Conversion Logger Logger Model Detector Bolt Bolt State Raw Logs©MapR Technologies - Confidential 25
  • A Quick Diversion You see a coin – What is the probability of heads? – Could it be larger or smaller than that? I flip the coin and while it is in the air ask again I catch the coin and ask again I look at the coin (and you don’t) and ask again Why does the answer change? – And did it ever have a single value?©MapR Technologies - Confidential 26
  • A First Conclusion Probability as expressed by humans is subjective and depends on information and experience©MapR Technologies - Confidential 27
  • A Second Diversion What is the mass of the moon? – 1/2 degree @ 385 Mm = ~ 3.8 Mm diameter (really about 3.4-ish) – V = 1/6 x pi x 3.83 x 1018 m3 = ~ 29 x 1018 m3 (really about 22) – m = rho V = 4 Mg/m3 x 29 x 1018 m3 = 1.2 x 1023 kg (really about 0.7) Is that the exact number? – Shouldn’t we have confidence bounds? Wikipedia says: 7.3477 × 1022 kg – Is that the exact number? – Shouldn’t they have confidence bounds?©MapR Technologies - Confidential 28
  • A Second Conclusion A single number is a bad way to express uncertain knowledge A distribution of values might be better©MapR Technologies - Confidential 29
  • I Dunno©MapR Technologies - Confidential 30
  • 5 and 5©MapR Technologies - Confidential 31
  • 2 and 10©MapR Technologies - Confidential 32
  • Bayesian Bandit Compute distributions based on data Sample p1 and p2 from these distributions Put a coin in bandit 1 if p1 > p2 Else, put the coin in bandit 2©MapR Technologies - Confidential 33
  • And it works! 0.12 0.11 0.1 0.09 0.08 0.07 regret 0.06 ε- greedy, ε = 0.05 0.05 0.04 Bayesian Bandit with Gam m a- Norm al 0.03 0.02 0.01 0 0 100 200 300 400 500 600 700 800 900 1000 1100 n©MapR Technologies - Confidential 34
  • Video Demo©MapR Technologies - Confidential 35
  • The Code Select an alternative n = dim(k)[1] p0 = rep(0, length.out=n) for (i in 1:n) { p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1) } return (which(p0 == max(p0))) Select and learn for (z in 1:steps) { i = select(k) j = test(i) k[i,j] = k[i,j]+1 } return (k) But we already know how to count!©MapR Technologies - Confidential 36
  • The Basic Idea We can encode a distribution by sampling Sampling allows unification of exploration and exploitation Can be extended to more general response models©MapR Technologies - Confidential 37
  •  Contact: – tdunning@maprtech.com – @ted_dunning Slides and such: – http://info.mapr.com/ted-uk-05-2012©MapR Technologies - Confidential 39
  • MapR’s Innovations©MapR Technologies - Confidential 40
  • Thank You©MapR Technologies - Confidential 41