Your SlideShare is downloading. ×
0
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Real-time and Long-time Together
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Real-time and Long-time Together

177

Published on

A talk that Ted Dunning gave at the Big Data Analytics meetup hosted by Klout about how real-time and long-time can be integrated into a single computation.

A talk that Ted Dunning gave at the Big Data Analytics meetup hosted by Klout about how real-time and long-time can be integrated into a single computation.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
177
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 1©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop
  • 2. 2©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop MapR
  • 3. 3©MapR Technologies - Confidential  Contact: – tdunning@maprtech.com – @ted_dunning  Slides and such (available late tonight): – http://info.mapr.com/ted-pjug  Hash tags: #mapr #pjug
  • 4. 4©MapR Technologies - Confidential The Challenge  Hadoop is great of processing vats of data – But sucks for real-time (by design!)  Storm is great for real-time processing – But lacks any way to deal with batch processing  It sounds like there isn’t a solution – Neither fashionable solution handles everything
  • 5. 5©MapR Technologies - Confidential This is not a problem. It’s an opportunity!
  • 6. 6©MapR Technologies - Confidential What is Map-Reduce?  Map-reduce programs are defined (mostly) by – A map function that does independent record transformations (and deletions and replications) – Reduce functions that do aggregation  Map-reduce programs run in framework that – Schedules and re-runs tasks – Splits the input – Moves map outputs to reduce inputs – Receives the results 6
  • 7. 7©MapR Technologies - Confidential Inside Map-Reduce 7 Input Map CombineShuffle and sort Reduce Output Reduce "The time has come," the Walrus said, "To talk of many things: Of shoes—and ships—and sealing-wax the, 1 time, 1 has, 1 come, 1 … come, [3,2,1] has, [1,5,2] the, [1,2,1] time, [10,1,3] … come, 6 has, 8 the, 4 time, 14 …
  • 8. 8©MapR Technologies - Confidential Not Just Text  Counting words is easy  Many other problems work as well – Sessionize user logs – Very large scale joins – Large scale matrix recommendations – Computing the quadrillionth digit of π  Map-reduce is inherently batch oriented
  • 9. 9©MapR Technologies - Confidential Inside Map-Reduce 9 Input Map Shuffle and sort Reduce Output road1, polyline(p1, p2, p3, …) lake1, polygon(p4, p5, p7, p9, …) road2, polyline(p6, p7, p9, …) tile0918-1412, road1 tile1082-8143, road1 tile0014-3284, lake1 tile1082-8143, lake1 … tile0918-1412, [road1] tile1082-8143, [road1, lake1] tile0014-3284, [lake1] … tile1082-8143, img#1 tile0014-3284, img#2 tile1082-8143, img#3 …
  • 10. 10©MapR Technologies - Confidential What is Storm  A Storm program is called a topology – Spouts inject data into a topology – Bolts process data  The units of data are called tuples  All processing is flow-through  Bolts can buffer or persist  Output tuples can be anchored  Bolts that fail are restarted and un-acked tuples are replayed
  • 11. 11©MapR Technologies - Confidential t now Hadoop is Not Very Real-time Unprocessed Data Fully processed Latest full period Hadoop job takes this long for this data
  • 12. 12©MapR Technologies - Confidential t now Hadoop works great back here Storm works here Real-time and Long-time together Blended view Blended view Blended View
  • 13. 13©MapR Technologies - Confidential One Alternative Search Engine NoSql de Jour Consumer Real-time Long-time ?
  • 14. 14©MapR Technologies - Confidential Problems  Simply dumping into noSql engine doesn’t quite work  Insert rate is limited  No load isolation – Big retrospective jobs kill real-time  Low scan performance – Hbase pretty good, but not stellar  Difficult to set boundaries – where does real-time end and long-time begin?
  • 15. 15©MapR Technologies - Confidential Rough Design – Data Flow Search Engine Query Event Spout Logger Bolt Counter Bolt Raw Logs Logger Bolt Semi Agg Hadoop Aggregator Snap Long agg Query Event Spout Counter Bolt Logger Bolt
  • 16. 16©MapR Technologies - Confidential Closer Look  Critical design goals: – fast ack for all tuples – fast restart of counter  Ack happens when tuple hits the replay log (100’s of milliseconds)  Restart involves replaying semi-agg’s + replay log (very fast) Counter Bolt Replay Log Semi- aggregated records Incoming records Real-time Long-time
  • 17. 17©MapR Technologies - Confidential A Frozen Moment in Time  Snapshot defines the dividing line  All data in the snap is long- time, all after is real-time  Semi-agg strategy allows clean query Semi Agg Hadoop Aggregator Snap Long agg
  • 18. 18©MapR Technologies - Confidential Guarantees  Counter output volume is small-ish – the greater of k tuples per 100K inputs or k tuple/s – 1 tuple/s/label/bolt for this exercise  Persistence layer must provide guarantees – distributed against node failure – must have either readable flush or closed-append  HDFS is distributed, but provides no guarantees and strange semantics  MapRfs is distributed, provides all necessary guarantees
  • 19. 19©MapR Technologies - Confidential Presentation Layer  Presentation must – read recent output of Logger bolt – read relevant output of Hadoop jobs – combine semi-aggregated records  User will see – counts that increment within 0-2 s of events – seamless meld of short and long-term data
  • 20. 20©MapR Technologies - Confidential Example 2 – AB testing in real-time  I have 15 versions of my landing page  Each visitor is assigned to a version – Which version?  A conversion or sale or whatever can happen – How long to wait?  Some versions of the landing page are horrible – Don’t want to give them traffic
  • 21. 21©MapR Technologies - Confidential A Quick Diversion  You see a coin – What is the probability of heads? – Could it be larger or smaller than that?  I flip the coin and while it is in the air ask again  I catch the coin and ask again  I look at the coin (and you don’t) and ask again  Why does the answer change? – And did it ever have a single value?
  • 22. 22©MapR Technologies - Confidential A Philosophical Conclusion  Probability as expressed by humans is subjective and depends on information and experience
  • 23. 23©MapR Technologies - Confidential I Dunno
  • 24. 24©MapR Technologies - Confidential 5 heads out of 10 throws
  • 25. 25©MapR Technologies - Confidential 2 heads out of 12 throws
  • 26. 26©MapR Technologies - Confidential Bayesian Bandit  Compute distributions based on data  Sample p1 and p2 from these distributions  Put a coin in bandit 1 if p1 > p2  Else, put the coin in bandit 2
  • 27. 27©MapR Technologies - Confidential And it works! 11000 100 200 300 400 500 600 700 800 900 1000 0.12 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 n regret ε- greedy, ε = 0.05 Bayesian Bandit with Gamma- Normal
  • 28. 28©MapR Technologies - Confidential Video Demo
  • 29. 29©MapR Technologies - Confidential The Code  Select an alternative  Select and learn  But we already know how to count! n = dim(k)[1] p0 = rep(0, length.out=n) for (i in 1:n) { p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1) } return (which(p0 == max(p0))) for (z in 1:steps) { i = select(k) j = test(i) k[i,j] = k[i,j]+1 } return (k)
  • 30. 30©MapR Technologies - Confidential The Basic Idea  We can encode a distribution by sampling  Sampling allows unification of exploration and exploitation  Can be extended to more general response models
  • 31. 31©MapR Technologies - Confidential  Contact: – tdunning@maprtech.com – @ted_dunning  Slides and such (available late tonight): – http://info.mapr.com/ted-pjug  Hash tags: #mapr #pjug
  • 32. 32©MapR Technologies - Confidential Thank You

×