Summingbird:
streaming portable map-reduce
Oscar Boykin | Twitter | @posco | @summingbird
@Twitter
What is summingbird?
2
1) Model for
streaming multi-
stage map-reduce
@Twitter
What is summingbird?
3
2) Implementations
to run this model
on Storm, Hadoop,
Spark and soon
@Twitter
What is summingbird?
4
2) Implementations
to run this model
on Storm, Hadoop,
Spark and soon
Portable
@Twitter
What is summingbird?
5
3) Systematic
implementation of
the “Lambda
Architecture”
@Twitter
What is summingbird?
6
3) Systematic
implementation of
the “Lambda
Architecture”
Fault Tolerant
@Twitter
What is streaming map-reduce?
7
Service
Source Source
Merge
SumByKey
Map
Map
Lookup
@Twitter
What is streaming map-reduce?
8
Lookup Service
Source Source
Merge
SumByKey
Map
Map
We can push single data
objec...
@Twitter 9
@Twitter 10
@Twitter 11
Why do I want this?
@Twitter 13
1) If our model assumes
streaming, one-at-a-time
semantics, we can run this code in
realtime (e.g. Storm) or i...
@Twitter
Again: Summingbird is a portability and abstraction layer
14
Summingbird allows you to write your job logic
once,...
@Twitter 15
2) We have optimizers at the
summingbird layer, and leverage
those optimizers across platforms
(combining join...
@Twitter 16
3) If we restrict our reduce
operators to a very general class,
we can automatically build a
lambda architectu...
What is the Lambda Architecture?
@Twitter
Lambda Architecture. @nathanmarz
http://lambda-architecture.net
18
But how do you build a lambda architecture?
@Twitter
All Hail the Monoid (associative operator)
20
2 + 3 = 61 +
@Twitter
All Hail the Monoid
21
2 + 3 = 61 +
=
5
All Hail the Monoid (associative operator)
@Twitter
All Hail the Monoid
22
2 + 3 = 61 +
=
3
All Hail the Monoid (associative operator)
@Twitter
Example Monoids
23
• (a min b) min c = a min (b min c)
• (a max b) max c = a max (b max c)
• (a or b) or c = a or...
@Twitter
Batching and associativity yields reliability
Batch0 Batch1 Batch2 Batch3
fault tolerant
Noisy
Realtime sums
from...
@Twitter
Batching and associativity yields reliability
Batch0 Batch1 Batch2 Batch3
fault tolerant
Noisy
Realtime sums
from...
@Twitter
Batching and associativity yields reliability
Batch0 Batch1 Batch2 Batch3
fault tolerant
Noisy
Log
Hadoop Hadoop ...
@Twitter
Lambda Architecture with Summingbird and Storehaus
27
Summingbird-scalding
Summingbird-storm
storehaus-memcache
s...
@Twitter
What has Twitter built with this?
28
* realtime dashboards: ads, operations,
publishers.
* stream transformation:...
@Twitter 29
f f f
+ + + + +
Tweets
(Flat)Mappers
Reducers
HDFS/Queue
HDFS/Queue
[(tweetid, CMS(domain -> 1)),
(0, CMS(tweetid -> 1))]
...
@Twitter 31
• The CMS is fixed size, so it never blows up.
• delta = 1%, eps = 0.1% gives table size ~5000.
• Can query any...
@Twitter
Review: @Summingbird is:
32
1) Portability/Optimization layer:
write once, run on many platforms
2) Systematic im...
@Twitter
Resources
33
twitter: @summingbird
mail: summingbird@groups.google.com
irc: freenode/#summingbird
github.com/twit...
@Twitter
Join us!
34
Twitter is hiring people to use and develop @scalding
and @summingbird to build realtime analytics an...
Thank you!
Upcoming SlideShare
Loading in...5
×

Summingbird: Streaming Portable, MapReduce

2,029

Published on

Published in: Technology

Summingbird: Streaming Portable, MapReduce

  1. 1. Summingbird: streaming portable map-reduce Oscar Boykin | Twitter | @posco | @summingbird
  2. 2. @Twitter What is summingbird? 2 1) Model for streaming multi- stage map-reduce
  3. 3. @Twitter What is summingbird? 3 2) Implementations to run this model on Storm, Hadoop, Spark and soon
  4. 4. @Twitter What is summingbird? 4 2) Implementations to run this model on Storm, Hadoop, Spark and soon Portable
  5. 5. @Twitter What is summingbird? 5 3) Systematic implementation of the “Lambda Architecture”
  6. 6. @Twitter What is summingbird? 6 3) Systematic implementation of the “Lambda Architecture” Fault Tolerant
  7. 7. @Twitter What is streaming map-reduce? 7 Service Source Source Merge SumByKey Map Map Lookup
  8. 8. @Twitter What is streaming map-reduce? 8 Lookup Service Source Source Merge SumByKey Map Map We can push single data objects from either of the sources, all the way through the topology => Conceptually, state can be updated incrementally.
  9. 9. @Twitter 9
  10. 10. @Twitter 10
  11. 11. @Twitter 11
  12. 12. Why do I want this?
  13. 13. @Twitter 13 1) If our model assumes streaming, one-at-a-time semantics, we can run this code in realtime (e.g. Storm) or in offline/ batch (e.g. Hadoop, Tez, Spark).
  14. 14. @Twitter Again: Summingbird is a portability and abstraction layer 14 Summingbird allows you to write your job logic once, and change the backend as needed. Go from batch to realtime, from Storm to Spark Streaming (eventually), from Hadoop to Spark, from Spark to Tez (soon).
  15. 15. @Twitter 15 2) We have optimizers at the summingbird layer, and leverage those optimizers across platforms (combining joins, map-side combiners, data-cubing optimizations).
  16. 16. @Twitter 16 3) If we restrict our reduce operators to a very general class, we can automatically build a lambda architecture system.
  17. 17. What is the Lambda Architecture?
  18. 18. @Twitter Lambda Architecture. @nathanmarz http://lambda-architecture.net 18
  19. 19. But how do you build a lambda architecture?
  20. 20. @Twitter All Hail the Monoid (associative operator) 20 2 + 3 = 61 +
  21. 21. @Twitter All Hail the Monoid 21 2 + 3 = 61 + = 5 All Hail the Monoid (associative operator)
  22. 22. @Twitter All Hail the Monoid 22 2 + 3 = 61 + = 3 All Hail the Monoid (associative operator)
  23. 23. @Twitter Example Monoids 23 • (a min b) min c = a min (b min c) • (a max b) max c = a max (b max c) • (a or b) or c = a or (b or c) • addition: (a + b) + c = a + (b + c) • set union: (a u b) u c = a u (b u c) • set intersection: (a n b) n c = a n (b n c) • harmonic sum: 1/(1/a + 1/b) • approximate unique count (HLL), approximate counter (CMS) • and vectors: [a1, a2] max [b1, b2] = [a1 max b1, a2 max b2]
  24. 24. @Twitter Batching and associativity yields reliability Batch0 Batch1 Batch2 Batch3 fault tolerant Noisy Realtime sums from 0, each batch Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RT 24 RT RT RT RT
  25. 25. @Twitter Batching and associativity yields reliability Batch0 Batch1 Batch2 Batch3 fault tolerant Noisy Realtime sums from 0, each batch Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RT 25 Hadoop keeps a total sum (reliably) RT RT RT RT
  26. 26. @Twitter Batching and associativity yields reliability Batch0 Batch1 Batch2 Batch3 fault tolerant Noisy Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RT 26 Sum of RT Batch(i) + Hadoop Batch (i-1) has bounded noise, bounded read/write size. Done at query time
  27. 27. @Twitter Lambda Architecture with Summingbird and Storehaus 27 Summingbird-scalding Summingbird-storm storehaus-memcache storehaus-algebra storehaus-hbase Kafka
  28. 28. @Twitter What has Twitter built with this? 28 * realtime dashboards: ads, operations, publishers. * stream transformation: filtering, mapping, joining then exporting * building realtime features for ML models. * top-K applications: most viewed, most clicked, etc..
  29. 29. @Twitter 29
  30. 30. f f f + + + + + Tweets (Flat)Mappers Reducers HDFS/Queue HDFS/Queue [(tweetid, CMS(domain -> 1)), (0, CMS(tweetid -> 1))] reduce: (x,y) => sum CMS tables (x,y) groupBy tweetid
  31. 31. @Twitter 31 • The CMS is fixed size, so it never blows up. • delta = 1%, eps = 0.1% gives table size ~5000. • Can query any (tweetid, 0 == all) for counts. • Can simultaneously keep track of the keys with the highest counts (heavy- hitters). • Using heavy-hitters, you can see top embedded tweets. • Add a time-bucket to the key for keeping history.
  32. 32. @Twitter Review: @Summingbird is: 32 1) Portability/Optimization layer: write once, run on many platforms 2) Systematic implementation of Lambda Architecture: easy fault tolerance, no design needed. 3) Real-world & high throughput.
  33. 33. @Twitter Resources 33 twitter: @summingbird mail: summingbird@groups.google.com irc: freenode/#summingbird github.com/twitter/summingbird
  34. 34. @Twitter Join us! 34 Twitter is hiring people to use and develop @scalding and @summingbird to build realtime analytics and ML. twitter: @posco email: oscar at twitter
  35. 35. Thank you!

×