Real-time and long-time together


Published on

A talk that I gave at the Big Data Analytics meetup hosted by Klout about how real-time and long-time can be integrated into a single computation.

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Real-time and long-time together

  1. 1. 1©MapR Technologies - ConfidentialReal-time and Long-time withStorm and Hadoop
  2. 2. 2©MapR Technologies - ConfidentialReal-time and Long-time withStorm and Hadoop MapR
  3. 3. 3©MapR Technologies - Confidential Contact:–– @ted_dunning Slides and such (available late tonight):– Hash tags: #mapr #storm
  4. 4. 4©MapR Technologies - ConfidentialThe Challenge Hadoop is great of processing vats of data– But sucks for real-time (by design!) Storm is great for real-time processing– But lacks any way to deal with batch processing It sounds like there isn’t a solution– Neither fashionable solution handles everything
  5. 5. 5©MapR Technologies - ConfidentialThis is not a problem.It’s an opportunity!
  6. 6. 6©MapR Technologies - ConfidentialWhat is Map-Reduce? Map-reduce programs are defined (mostly) by– A map function that does independent record transformations (anddeletions and replications)– Reduce functions that do aggregation Map-reduce programs run in framework that– Schedules and re-runs tasks– Splits the input– Moves map outputs to reduce inputs– Receives the results6
  7. 7. 7©MapR Technologies - ConfidentialInside Map-Reduce7Input Map CombineShuffleand sortReduce OutputReduce"The time has come," the Walrus said,"To talk of many things:Of shoes—and ships—and sealing-waxthe, 1time, 1has, 1come, 1…come, [3,2,1]has, [1,5,2]the, [1,2,1]time,[10,1,3]…come, 6has, 8the, 4time, 14…
  8. 8. 8©MapR Technologies - ConfidentialNot Just Text Counting words is easy Many other problems work as well– Sessionize user logs– Very large scale joins– Large scale matrix recommendations– Computing the quadrillionth digit of π Map-reduce is inherently batch oriented
  9. 9. 9©MapR Technologies - ConfidentialInside Map-Reduce9Input Map Shuffleand sortReduce Outputroad1, polyline(p1, p2, p3, …)lake1, polygon(p4, p5, p7, p9, …)road2, polyline(p6, p7, p9, …)tile0918-1412, road1tile1082-8143, road1tile0014-3284, lake1tile1082-8143, lake1…tile0918-1412, [road1]tile1082-8143, [road1, lake1]tile0014-3284, [lake1]…tile1082-8143, img#1tile0014-3284, img#2tile1082-8143, img#3…
  10. 10. 10©MapR Technologies - ConfidentialWhat is Storm? A Storm program is called a topology– Spouts inject data into a topology– Bolts process data The units of data are called tuples All processing is flow-through Bolts can buffer or persist Output tuples can be anchored If a tuple and all descendants are not ack’ed in time, bolt has died Bolts that fail are restarted and un-acked tuples are replayed
  11. 11. 11©MapR Technologies - ConfidentialtnowHadoop is Not Very Real-timeUnprocessedDataFullyprocessedLatest fullperiodHadoop jobtakes thislong for thisdata
  12. 12. 12©MapR Technologies - ConfidentialtnowHadoop worksgreat back hereStormworkshereReal-time and Long-time togetherBlendedviewBlendedviewBlendedView
  13. 13. 13©MapR Technologies - ConfidentialOne AlternativeSearchEngineNoSqlde JourConsumerReal-time Long-time?
  14. 14. 14©MapR Technologies - ConfidentialProblems Simply dumping into noSql engine doesn’t quite work Insert rate is limited No load isolation– Big retrospective jobs kill real-time Low scan performance– Hbase pretty good, but not stellar Difficult to set boundaries– where does real-time end and long-time begin?
  15. 15. 15©MapR Technologies - ConfidentialDataSourcesCatcherClusterRough Design – Data FlowCatcherClusterQuery EventSpoutLoggerBoltCounterBoltRawLogsLoggerBoltSemiAggHadoopAggregatorSnapLongaggProtoSpoutCounterBoltLoggerBoltDataSources
  16. 16. 16©MapR Technologies - ConfidentialCloser Look – Catcher ProtocolDataSourcesCatcherClusterCatcherClusterDataSourcesThe data sources and catcherscommunicate with a very simpleprotocol.Hello() => list of catchersLog(topic,message) =>(OK|FAIL, redirect-to-catcher)
  17. 17. 17©MapR Technologies - ConfidentialCloser Look – Catcher QueuesCatcherClusterCatcherClusterThe catchers forward log requeststo the correct catcher and returnthat host in the reply to allow theclient to avoid the extra hop.Each topic file is appended byexactly one catcher.Topic files are kept in shared filestorage.TopicFileTopicFile
  18. 18. 18©MapR Technologies - ConfidentialCloser Look – ProtoSpoutThe ProtoSpout tails the topic files,parses log records into tuples andinjects them into the Stormtopology.Last fully acked position stored inshared file system.TopicFileTopicFileProtoSpout
  19. 19. 19©MapR Technologies - ConfidentialCloser Look – Counter Bolt Critical design goals:– fast ack for all tuples– fast restart of counter Ack happens when tuple hits the replay log (10’s of milliseconds) Restart involves replaying semi-agg’s + replay log (very fast) Replay log only lasts until next semi-aggregate goes outCounterBoltReplayLogSemi-aggregatedrecordsIncomingrecordsReal-time Long-time
  20. 20. 20©MapR Technologies - ConfidentialA Frozen Moment in Time Snapshot defines the dividingline All data in the snap is long-time, all after is real-time Semi-agg strategy allows cleanquerySemiAggHadoopAggregatorSnapLongagg
  21. 21. 21©MapR Technologies - ConfidentialGuarantees Counter output volume is small-ish– the greater of k tuples per 100K inputs or k tuple/s– 1 tuple/s/label/bolt for this exercise Persistence layer must provide guarantees– distributed against node failure– must have either readable flush or closed-append HDFS is distributed, but provides no guarantees and strangesemantics MapRfs is distributed, provides all necessary guarantees
  22. 22. 22©MapR Technologies - ConfidentialPresentation Layer Presentation must– read recent output of Logger bolt– read relevant output of Hadoop jobs– combine semi-aggregated records User will see– counts that increment within 0-2 s of events– seamless and accurate meld of short and long-term data
  23. 23. 23©MapR Technologies - ConfidentialExample 2 – AB testing in real-time I have 15 versions of my landing page Each visitor is assigned to a version– Which version? A conversion or sale or whatever can happen– How long to wait? Some versions of the landing page are horrible– Don’t want to give them traffic
  24. 24. 29©MapR Technologies - ConfidentialBayesian Bandit Compute distributions based on data Sample p1 and p2 from these distributions Put a coin in bandit 1 if p1 > p2 Else, put the coin in bandit 2
  25. 25. 30©MapR Technologies - ConfidentialAnd it works!11000 100 200 300 400 500 600 700 800 900 10000.1200.ε- greedy, ε = 0.05Bayesian Bandit with Gamma- Normal
  26. 26. 31©MapR Technologies - ConfidentialVideo Demo
  27. 27. 32©MapR Technologies - ConfidentialThe Code Select an alternative Select and learn But we already know how to count!n = dim(k)[1]p0 = rep(0, length.out=n)for (i in 1:n) {p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1)}return (which(p0 == max(p0)))for (z in 1:steps) {i = select(k)j = test(i)k[i,j] = k[i,j]+1}return (k)
  28. 28. 33©MapR Technologies - ConfidentialThe Basic Idea We can encode a distribution by sampling Sampling allows unification of exploration and exploitation Can be extended to more general response models
  29. 29. 34©MapR Technologies - ConfidentialThe Bigger Basic Idea Online algorithms generally have relatively small state (likecounting) Online algorithms generally have a simple update (like counting) If we can do this with counting, we can do it with all kinds ofalgorithms
  30. 30. 37©MapR Technologies - Confidential Contact:–– @ted_dunning Slides and such (available late tonight):– Send email !! Hash tags: #mapr #bigdata We are hiring!
  31. 31. 38©MapR Technologies - ConfidentialThank You