Your SlideShare is downloading. ×
0
1©MapR Technologies - Confidential
Real-time and Long-time with
Storm and Hadoop
2©MapR Technologies - Confidential
Real-time and Long-time with
Storm and Hadoop MapR
3©MapR Technologies - Confidential
 Contact:
– tdunning@maprtech.com
– @ted_dunning
 Slides and such (available late ton...
4©MapR Technologies - Confidential
The Challenge
 Hadoop is great of processing vats of data
– But sucks for real-time (b...
5©MapR Technologies - Confidential
This is not a problem.
It’s an opportunity!
6©MapR Technologies - Confidential
What is Map-Reduce?
 Map-reduce programs are defined (mostly) by
– A map function that...
7©MapR Technologies - Confidential
Inside Map-Reduce
7
Input Map CombineShuffle
and sort
Reduce Output
Reduce
"The time ha...
8©MapR Technologies - Confidential
Not Just Text
 Counting words is easy
 Many other problems work as well
– Sessionize ...
9©MapR Technologies - Confidential
Inside Map-Reduce
9
Input Map Shuffle
and sort
Reduce Output
road1, polyline(p1, p2, p3...
10©MapR Technologies - Confidential
What is Storm
 A Storm program is called a topology
– Spouts inject data into a topol...
11©MapR Technologies - Confidential
t
now
Hadoop is Not Very Real-time
Unprocessed
Data
Fully
processed
Latest full
period...
12©MapR Technologies - Confidential
t
now
Hadoop works
great back here
Storm
works
here
Real-time and Long-time together
B...
13©MapR Technologies - Confidential
One Alternative
Search
Engine
NoSql
de Jour
Consumer
Real-time Long-time
?
14©MapR Technologies - Confidential
Problems
 Simply dumping into noSql engine doesn’t quite work
 Insert rate is limite...
15©MapR Technologies - Confidential
Rough Design – Data Flow
Search
Engine
Query Event
Spout
Logger
Bolt
Counter
Bolt
Raw
...
16©MapR Technologies - Confidential
Closer Look
 Critical design goals:
– fast ack for all tuples
– fast restart of count...
17©MapR Technologies - Confidential
A Frozen Moment in Time
 Snapshot defines the dividing
line
 All data in the snap is...
18©MapR Technologies - Confidential
Guarantees
 Counter output volume is small-ish
– the greater of k tuples per 100K inp...
19©MapR Technologies - Confidential
Presentation Layer
 Presentation must
– read recent output of Logger bolt
– read rele...
20©MapR Technologies - Confidential
Example 2 – AB testing in real-time
 I have 15 versions of my landing page
 Each vis...
21©MapR Technologies - Confidential
A Quick Diversion
 You see a coin
– What is the probability of heads?
– Could it be l...
22©MapR Technologies - Confidential
A Philosophical Conclusion
 Probability as expressed by humans is subjective and depe...
23©MapR Technologies - Confidential
I Dunno
24©MapR Technologies - Confidential
5 heads out of 10 throws
25©MapR Technologies - Confidential
2 heads out of 12 throws
26©MapR Technologies - Confidential
Bayesian Bandit
 Compute distributions based on data
 Sample p1 and p2 from these di...
27©MapR Technologies - Confidential
And it works!
11000 100 200 300 400 500 600 700 800 900 1000
0.12
0
0.01
0.02
0.03
0.0...
28©MapR Technologies - Confidential
Video Demo
29©MapR Technologies - Confidential
The Code
 Select an alternative
 Select and learn
 But we already know how to count...
30©MapR Technologies - Confidential
The Basic Idea
 We can encode a distribution by sampling
 Sampling allows unificatio...
31©MapR Technologies - Confidential
 Contact:
– tdunning@maprtech.com
– @ted_dunning
 Slides and such (available late to...
32©MapR Technologies - Confidential
Thank You
Upcoming SlideShare
Loading in...5
×

Real-time and Long-time Together

186

Published on

A talk that Ted Dunning gave at the Big Data Analytics meetup hosted by Klout about how real-time and long-time can be integrated into a single computation.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
186
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Real-time and Long-time Together"

  1. 1. 1©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop
  2. 2. 2©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop MapR
  3. 3. 3©MapR Technologies - Confidential  Contact: – tdunning@maprtech.com – @ted_dunning  Slides and such (available late tonight): – http://info.mapr.com/ted-pjug  Hash tags: #mapr #pjug
  4. 4. 4©MapR Technologies - Confidential The Challenge  Hadoop is great of processing vats of data – But sucks for real-time (by design!)  Storm is great for real-time processing – But lacks any way to deal with batch processing  It sounds like there isn’t a solution – Neither fashionable solution handles everything
  5. 5. 5©MapR Technologies - Confidential This is not a problem. It’s an opportunity!
  6. 6. 6©MapR Technologies - Confidential What is Map-Reduce?  Map-reduce programs are defined (mostly) by – A map function that does independent record transformations (and deletions and replications) – Reduce functions that do aggregation  Map-reduce programs run in framework that – Schedules and re-runs tasks – Splits the input – Moves map outputs to reduce inputs – Receives the results 6
  7. 7. 7©MapR Technologies - Confidential Inside Map-Reduce 7 Input Map CombineShuffle and sort Reduce Output Reduce "The time has come," the Walrus said, "To talk of many things: Of shoes—and ships—and sealing-wax the, 1 time, 1 has, 1 come, 1 … come, [3,2,1] has, [1,5,2] the, [1,2,1] time, [10,1,3] … come, 6 has, 8 the, 4 time, 14 …
  8. 8. 8©MapR Technologies - Confidential Not Just Text  Counting words is easy  Many other problems work as well – Sessionize user logs – Very large scale joins – Large scale matrix recommendations – Computing the quadrillionth digit of π  Map-reduce is inherently batch oriented
  9. 9. 9©MapR Technologies - Confidential Inside Map-Reduce 9 Input Map Shuffle and sort Reduce Output road1, polyline(p1, p2, p3, …) lake1, polygon(p4, p5, p7, p9, …) road2, polyline(p6, p7, p9, …) tile0918-1412, road1 tile1082-8143, road1 tile0014-3284, lake1 tile1082-8143, lake1 … tile0918-1412, [road1] tile1082-8143, [road1, lake1] tile0014-3284, [lake1] … tile1082-8143, img#1 tile0014-3284, img#2 tile1082-8143, img#3 …
  10. 10. 10©MapR Technologies - Confidential What is Storm  A Storm program is called a topology – Spouts inject data into a topology – Bolts process data  The units of data are called tuples  All processing is flow-through  Bolts can buffer or persist  Output tuples can be anchored  Bolts that fail are restarted and un-acked tuples are replayed
  11. 11. 11©MapR Technologies - Confidential t now Hadoop is Not Very Real-time Unprocessed Data Fully processed Latest full period Hadoop job takes this long for this data
  12. 12. 12©MapR Technologies - Confidential t now Hadoop works great back here Storm works here Real-time and Long-time together Blended view Blended view Blended View
  13. 13. 13©MapR Technologies - Confidential One Alternative Search Engine NoSql de Jour Consumer Real-time Long-time ?
  14. 14. 14©MapR Technologies - Confidential Problems  Simply dumping into noSql engine doesn’t quite work  Insert rate is limited  No load isolation – Big retrospective jobs kill real-time  Low scan performance – Hbase pretty good, but not stellar  Difficult to set boundaries – where does real-time end and long-time begin?
  15. 15. 15©MapR Technologies - Confidential Rough Design – Data Flow Search Engine Query Event Spout Logger Bolt Counter Bolt Raw Logs Logger Bolt Semi Agg Hadoop Aggregator Snap Long agg Query Event Spout Counter Bolt Logger Bolt
  16. 16. 16©MapR Technologies - Confidential Closer Look  Critical design goals: – fast ack for all tuples – fast restart of counter  Ack happens when tuple hits the replay log (100’s of milliseconds)  Restart involves replaying semi-agg’s + replay log (very fast) Counter Bolt Replay Log Semi- aggregated records Incoming records Real-time Long-time
  17. 17. 17©MapR Technologies - Confidential A Frozen Moment in Time  Snapshot defines the dividing line  All data in the snap is long- time, all after is real-time  Semi-agg strategy allows clean query Semi Agg Hadoop Aggregator Snap Long agg
  18. 18. 18©MapR Technologies - Confidential Guarantees  Counter output volume is small-ish – the greater of k tuples per 100K inputs or k tuple/s – 1 tuple/s/label/bolt for this exercise  Persistence layer must provide guarantees – distributed against node failure – must have either readable flush or closed-append  HDFS is distributed, but provides no guarantees and strange semantics  MapRfs is distributed, provides all necessary guarantees
  19. 19. 19©MapR Technologies - Confidential Presentation Layer  Presentation must – read recent output of Logger bolt – read relevant output of Hadoop jobs – combine semi-aggregated records  User will see – counts that increment within 0-2 s of events – seamless meld of short and long-term data
  20. 20. 20©MapR Technologies - Confidential Example 2 – AB testing in real-time  I have 15 versions of my landing page  Each visitor is assigned to a version – Which version?  A conversion or sale or whatever can happen – How long to wait?  Some versions of the landing page are horrible – Don’t want to give them traffic
  21. 21. 21©MapR Technologies - Confidential A Quick Diversion  You see a coin – What is the probability of heads? – Could it be larger or smaller than that?  I flip the coin and while it is in the air ask again  I catch the coin and ask again  I look at the coin (and you don’t) and ask again  Why does the answer change? – And did it ever have a single value?
  22. 22. 22©MapR Technologies - Confidential A Philosophical Conclusion  Probability as expressed by humans is subjective and depends on information and experience
  23. 23. 23©MapR Technologies - Confidential I Dunno
  24. 24. 24©MapR Technologies - Confidential 5 heads out of 10 throws
  25. 25. 25©MapR Technologies - Confidential 2 heads out of 12 throws
  26. 26. 26©MapR Technologies - Confidential Bayesian Bandit  Compute distributions based on data  Sample p1 and p2 from these distributions  Put a coin in bandit 1 if p1 > p2  Else, put the coin in bandit 2
  27. 27. 27©MapR Technologies - Confidential And it works! 11000 100 200 300 400 500 600 700 800 900 1000 0.12 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 n regret ε- greedy, ε = 0.05 Bayesian Bandit with Gamma- Normal
  28. 28. 28©MapR Technologies - Confidential Video Demo
  29. 29. 29©MapR Technologies - Confidential The Code  Select an alternative  Select and learn  But we already know how to count! n = dim(k)[1] p0 = rep(0, length.out=n) for (i in 1:n) { p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1) } return (which(p0 == max(p0))) for (z in 1:steps) { i = select(k) j = test(i) k[i,j] = k[i,j]+1 } return (k)
  30. 30. 30©MapR Technologies - Confidential The Basic Idea  We can encode a distribution by sampling  Sampling allows unification of exploration and exploitation  Can be extended to more general response models
  31. 31. 31©MapR Technologies - Confidential  Contact: – tdunning@maprtech.com – @ted_dunning  Slides and such (available late tonight): – http://info.mapr.com/ted-pjug  Hash tags: #mapr #pjug
  32. 32. 32©MapR Technologies - Confidential Thank You
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×