Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

1,137 views

Published on

No Downloads

Total views

1,137

On SlideShare

0

From Embeds

0

Number of Embeds

3

Shares

0

Downloads

0

Comments

0

Likes

2

No embeds

No notes for slide

- 1. 1©MapR Technologies - ConfidentialReal-time Learningfor Fun and Profit
- 2. 2©MapR Technologies - Confidential Contact:– tdunning@maprtech.com– @ted_dunning Slides and such (available late tonight):– http://slideshare.net/tdunning Hash tags: #mapr #storm #bbuzz
- 3. 3©MapR Technologies - ConfidentialThe Challenge Hadoop is great of processing vats of data– But sucks for real-time (by design!) Storm is great for real-time processing– But lacks any way to deal with batch processing It sounds like there isn’t a solution– Neither fashionable solution handles everything
- 4. 4©MapR Technologies - ConfidentialThis is not a problem.It’s an opportunity!
- 5. 5©MapR Technologies - ConfidentialtnowHadoop is Not Very Real-timeUnprocessedDataFullyprocessedLatest fullperiodHadoop jobtakes thislong for thisdata
- 6. 6©MapR Technologies - ConfidentialtnowHadoop worksgreat back hereStormworkshereReal-time and Long-time togetherBlendedviewBlendedviewBlendedView
- 7. 7©MapR Technologies - ConfidentialOne AlternativeSearchEngineNoSqlde JourConsumerReal-time Long-time?
- 8. 8©MapR Technologies - ConfidentialProblems Simply dumping into noSql engine doesn’t quite work Insert rate is limited No load isolation– Big retrospective jobs kill real-time Low scan performance– Hbase pretty good, but not stellar Difficult to set boundaries– where does real-time end and long-time begin?
- 9. 9©MapR Technologies - ConfidentialAlmost a Solution Lambda architecture talks about function of long-time state– Real-time approximate accelerator adjusts previous result to current state Sounds good, but …– How does the real-time accelerator combine with long-time?– What algorithms can do this?– How can we avoid gaps and overlaps and other errors? Needs more work
- 10. 10©MapR Technologies - ConfidentialA Simple Example Let’s start with the simplest case … counting Counting = addition– Addition is associative– Addition is on-line– We can generalize these results to all associative, on-line functions– But let’s start simple
- 11. 11©MapR Technologies - ConfidentialDataSourcesCatcherClusterRough Design – Data FlowCatcherClusterQuery EventSpoutLoggerBoltCounterBoltRawLogsLoggerBoltSemiAggHadoopAggregatorSnapLongaggProtoSpoutCounterBoltLoggerBoltDataSources
- 12. 12©MapR Technologies - ConfidentialCloser Look – Catcher ProtocolDataSourcesCatcherClusterCatcherClusterDataSourcesThe data sources and catcherscommunicate with a very simpleprotocol.Hello() => list of catchersLog(topic,message) =>(OK|FAIL, redirect-to-catcher)
- 13. 13©MapR Technologies - ConfidentialCloser Look – Catcher QueuesCatcherClusterCatcherClusterThe catchers forward log requeststo the correct catcher and returnthat host in the reply to allow theclient to avoid the extra hop.Each topic file is appended byexactly one catcher.Topic files are kept in shared filestorage.TopicFileTopicFile
- 14. 14©MapR Technologies - ConfidentialCloser Look – ProtoSpoutThe ProtoSpout tails the topic files,parses log records into tuples andinjects them into the Stormtopology.Last fully acked position stored inshared, transactionally correct filesystem.TopicFileTopicFileProtoSpout
- 15. 15©MapR Technologies - ConfidentialCloser Look – Counter Bolt Critical design goals:– fast ack for all tuples– fast restart of counter Ack happens when tuple hits the replay log (10’s of milliseconds,group commit) Restart involves replaying semi-agg’s + replay log (very fast) Replay log only lasts until next semi-aggregate goes outCounterBoltReplayLogSemi-aggregatedrecordsIncomingrecordsReal-time Long-time
- 16. 16©MapR Technologies - ConfidentialA Frozen Moment in Time Snapshot defines the dividing line All data in the snap is long-time, allafter is real-time Semi-agg strategy allows cleancombination of both kinds of data Data synchronized snap notneededSemiAggHadoopAggregatorSnapLongagg
- 17. 17©MapR Technologies - ConfidentialGuarantees Counter output volume is small-ish– the greater of k tuples per 100K inputs or k tuple/s– 1 tuple/s/label/bolt for this exercise Persistence layer must provide guarantees– distributed against node failure– must have either readable flush or closed-append HDFS is distributed, but provides no guarantees and strangesemantics MapRfs is distributed, provides all necessary guarantees
- 18. 18©MapR Technologies - ConfidentialPresentation Layer Presentation must– read recent output of Logger bolt– read relevant output of Hadoop jobs– combine semi-aggregated records User will see– counts that increment within 0-2 s of events– seamless and accurate meld of short and long-term data
- 19. 19©MapR Technologies - ConfidentialThe Basic Idea Online algorithms generally have relatively small state (likecounting) Online algorithms generally have a simple update (like counting) If we can do this with counting, we can do it with all kinds ofalgorithms
- 20. 20©MapR Technologies - ConfidentialSummary – Part 1 Semi-agg strategy + snapshots allows correct real-time counts– because addition is on-line and associative Other on-line associative operations include:– k-means clustering (see Dan Filimon’s talk at 16.)– count distinct (see hyper-log-log counters from streamlib or kmv fromBrickhouse)– top-k values– top-k (count(*)) (see streamlib)– contextual Bayesian bandits (see part 2 of this talk)
- 21. 21©MapR Technologies - ConfidentialExample 2 – AB testing in real-time I have 15 versions of my landing page Each visitor is assigned to a version– Which version? A conversion or sale or whatever can happen– How long to wait? Some versions of the landing page are horrible– Don’t want to give them traffic
- 22. 22©MapR Technologies - ConfidentialA Quick Diversion You see a coin– What is the probability of heads?– Could it be larger or smaller than that? I flip the coin and while it is in the air ask again I catch the coin and ask again I look at the coin (and you don’t) and ask again Why does the answer change?– And did it ever have a single value?
- 23. 23©MapR Technologies - ConfidentialA Philosophical Conclusion Probability as expressed by humans is subjective and depends oninformation and experience
- 24. 24©MapR Technologies - ConfidentialI Dunno
- 25. 25©MapR Technologies - Confidential5 heads out of 10 throws
- 26. 26©MapR Technologies - Confidential2 heads out of 12 throwsMeanUsing any single number as a “best”estimate denies the uncertain nature ofa distributionAdding confidence bounds still loses most ofthe information in the distribution andprevents good modeling of the tails
- 27. 27©MapR Technologies - ConfidentialBayesian Bandit Compute distributions based on data Sample p1 and p2 from these distributions Put a coin in bandit 1 if p1 > p2 Else, put the coin in bandit 2
- 28. 28©MapR Technologies - ConfidentialAnd it works!11000 100 200 300 400 500 600 700 800 900 10000.1200.010.020.030.040.050.060.070.080.090.10.11nregretε- greedy, ε = 0.05Bayesian Bandit with Gamma- Normal
- 29. 29©MapR Technologies - ConfidentialVideo Demo
- 30. 30©MapR Technologies - ConfidentialThe Code Select an alternative Select and learn But we already know how to count!n = dim(k)[1]p0 = rep(0, length.out=n)for (i in 1:n) {p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1)}return (which(p0 == max(p0)))for (z in 1:steps) {i = select(k)j = test(i)k[i,j] = k[i,j]+1}return (k)
- 31. 31©MapR Technologies - ConfidentialThe Basic Idea We can encode a distribution by sampling Sampling allows unification of exploration and exploitation Can be extended to more general response models Note that learning here = counting = on-line algorithm
- 32. 33©MapR Technologies - ConfidentialCaveats Original Bayesian Bandit only requires real-time Generalized Bandit may require access to long history for learning– Pseudo online learning may be easier than true online Bandit variables can include content, time of day, day of week Context variables can include user id, user features Bandit × context variables provide the real power
- 33. 34©MapR Technologies - Confidential Contact:– tdunning@maprtech.com– @ted_dunning Slides and such (available late tonight):– http://slideshare.net/tdunning Hash tags: #mapr #storm #bbuzz
- 34. 35©MapR Technologies - ConfidentialThank You

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment