• Like
  • Save
Real-time and Long-time Together
Upcoming SlideShare
Loading in...5

Real-time and Long-time Together



A talk that Ted Dunning gave at the Big Data Analytics meetup hosted by Klout about how real-time and long-time can be integrated into a single computation.

A talk that Ted Dunning gave at the Big Data Analytics meetup hosted by Klout about how real-time and long-time can be integrated into a single computation.



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Real-time and Long-time Together Real-time and Long-time Together Presentation Transcript

    • 1©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop
    • 2©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop MapR
    • 3©MapR Technologies - Confidential  Contact: – tdunning@maprtech.com – @ted_dunning  Slides and such (available late tonight): – http://info.mapr.com/ted-pjug  Hash tags: #mapr #pjug
    • 4©MapR Technologies - Confidential The Challenge  Hadoop is great of processing vats of data – But sucks for real-time (by design!)  Storm is great for real-time processing – But lacks any way to deal with batch processing  It sounds like there isn’t a solution – Neither fashionable solution handles everything
    • 5©MapR Technologies - Confidential This is not a problem. It’s an opportunity!
    • 6©MapR Technologies - Confidential What is Map-Reduce?  Map-reduce programs are defined (mostly) by – A map function that does independent record transformations (and deletions and replications) – Reduce functions that do aggregation  Map-reduce programs run in framework that – Schedules and re-runs tasks – Splits the input – Moves map outputs to reduce inputs – Receives the results 6
    • 7©MapR Technologies - Confidential Inside Map-Reduce 7 Input Map CombineShuffle and sort Reduce Output Reduce "The time has come," the Walrus said, "To talk of many things: Of shoes—and ships—and sealing-wax the, 1 time, 1 has, 1 come, 1 … come, [3,2,1] has, [1,5,2] the, [1,2,1] time, [10,1,3] … come, 6 has, 8 the, 4 time, 14 …
    • 8©MapR Technologies - Confidential Not Just Text  Counting words is easy  Many other problems work as well – Sessionize user logs – Very large scale joins – Large scale matrix recommendations – Computing the quadrillionth digit of π  Map-reduce is inherently batch oriented
    • 9©MapR Technologies - Confidential Inside Map-Reduce 9 Input Map Shuffle and sort Reduce Output road1, polyline(p1, p2, p3, …) lake1, polygon(p4, p5, p7, p9, …) road2, polyline(p6, p7, p9, …) tile0918-1412, road1 tile1082-8143, road1 tile0014-3284, lake1 tile1082-8143, lake1 … tile0918-1412, [road1] tile1082-8143, [road1, lake1] tile0014-3284, [lake1] … tile1082-8143, img#1 tile0014-3284, img#2 tile1082-8143, img#3 …
    • 10©MapR Technologies - Confidential What is Storm  A Storm program is called a topology – Spouts inject data into a topology – Bolts process data  The units of data are called tuples  All processing is flow-through  Bolts can buffer or persist  Output tuples can be anchored  Bolts that fail are restarted and un-acked tuples are replayed
    • 11©MapR Technologies - Confidential t now Hadoop is Not Very Real-time Unprocessed Data Fully processed Latest full period Hadoop job takes this long for this data
    • 12©MapR Technologies - Confidential t now Hadoop works great back here Storm works here Real-time and Long-time together Blended view Blended view Blended View
    • 13©MapR Technologies - Confidential One Alternative Search Engine NoSql de Jour Consumer Real-time Long-time ?
    • 14©MapR Technologies - Confidential Problems  Simply dumping into noSql engine doesn’t quite work  Insert rate is limited  No load isolation – Big retrospective jobs kill real-time  Low scan performance – Hbase pretty good, but not stellar  Difficult to set boundaries – where does real-time end and long-time begin?
    • 15©MapR Technologies - Confidential Rough Design – Data Flow Search Engine Query Event Spout Logger Bolt Counter Bolt Raw Logs Logger Bolt Semi Agg Hadoop Aggregator Snap Long agg Query Event Spout Counter Bolt Logger Bolt
    • 16©MapR Technologies - Confidential Closer Look  Critical design goals: – fast ack for all tuples – fast restart of counter  Ack happens when tuple hits the replay log (100’s of milliseconds)  Restart involves replaying semi-agg’s + replay log (very fast) Counter Bolt Replay Log Semi- aggregated records Incoming records Real-time Long-time
    • 17©MapR Technologies - Confidential A Frozen Moment in Time  Snapshot defines the dividing line  All data in the snap is long- time, all after is real-time  Semi-agg strategy allows clean query Semi Agg Hadoop Aggregator Snap Long agg
    • 18©MapR Technologies - Confidential Guarantees  Counter output volume is small-ish – the greater of k tuples per 100K inputs or k tuple/s – 1 tuple/s/label/bolt for this exercise  Persistence layer must provide guarantees – distributed against node failure – must have either readable flush or closed-append  HDFS is distributed, but provides no guarantees and strange semantics  MapRfs is distributed, provides all necessary guarantees
    • 19©MapR Technologies - Confidential Presentation Layer  Presentation must – read recent output of Logger bolt – read relevant output of Hadoop jobs – combine semi-aggregated records  User will see – counts that increment within 0-2 s of events – seamless meld of short and long-term data
    • 20©MapR Technologies - Confidential Example 2 – AB testing in real-time  I have 15 versions of my landing page  Each visitor is assigned to a version – Which version?  A conversion or sale or whatever can happen – How long to wait?  Some versions of the landing page are horrible – Don’t want to give them traffic
    • 21©MapR Technologies - Confidential A Quick Diversion  You see a coin – What is the probability of heads? – Could it be larger or smaller than that?  I flip the coin and while it is in the air ask again  I catch the coin and ask again  I look at the coin (and you don’t) and ask again  Why does the answer change? – And did it ever have a single value?
    • 22©MapR Technologies - Confidential A Philosophical Conclusion  Probability as expressed by humans is subjective and depends on information and experience
    • 23©MapR Technologies - Confidential I Dunno
    • 24©MapR Technologies - Confidential 5 heads out of 10 throws
    • 25©MapR Technologies - Confidential 2 heads out of 12 throws
    • 26©MapR Technologies - Confidential Bayesian Bandit  Compute distributions based on data  Sample p1 and p2 from these distributions  Put a coin in bandit 1 if p1 > p2  Else, put the coin in bandit 2
    • 27©MapR Technologies - Confidential And it works! 11000 100 200 300 400 500 600 700 800 900 1000 0.12 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 n regret ε- greedy, ε = 0.05 Bayesian Bandit with Gamma- Normal
    • 28©MapR Technologies - Confidential Video Demo
    • 29©MapR Technologies - Confidential The Code  Select an alternative  Select and learn  But we already know how to count! n = dim(k)[1] p0 = rep(0, length.out=n) for (i in 1:n) { p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1) } return (which(p0 == max(p0))) for (z in 1:steps) { i = select(k) j = test(i) k[i,j] = k[i,j]+1 } return (k)
    • 30©MapR Technologies - Confidential The Basic Idea  We can encode a distribution by sampling  Sampling allows unification of exploration and exploitation  Can be extended to more general response models
    • 31©MapR Technologies - Confidential  Contact: – tdunning@maprtech.com – @ted_dunning  Slides and such (available late tonight): – http://info.mapr.com/ted-pjug  Hash tags: #mapr #pjug
    • 32©MapR Technologies - Confidential Thank You