• Save
Summingbird: Streaming Portable, MapReduce
Upcoming SlideShare
Loading in...5
×
 

Summingbird: Streaming Portable, MapReduce

on

  • 667 views

 

Statistics

Views

Total Views
667
Views on SlideShare
652
Embed Views
15

Actions

Likes
8
Downloads
0
Comments
0

2 Embeds 15

https://twitter.com 12
http://geekple.com 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Summingbird: Streaming Portable, MapReduce Summingbird: Streaming Portable, MapReduce Presentation Transcript

  • Summingbird: streaming portable map-reduce Oscar Boykin | Twitter | @posco | @summingbird
  • @Twitter What is summingbird? 2 1) Model for streaming multi- stage map-reduce
  • @Twitter What is summingbird? 3 2) Implementations to run this model on Storm, Hadoop, Spark and soon
  • @Twitter What is summingbird? 4 2) Implementations to run this model on Storm, Hadoop, Spark and soon Portable
  • @Twitter What is summingbird? 5 3) Systematic implementation of the “Lambda Architecture”
  • @Twitter What is summingbird? 6 3) Systematic implementation of the “Lambda Architecture” Fault Tolerant
  • @Twitter What is streaming map-reduce? 7 Service Source Source Merge SumByKey Map Map Lookup
  • @Twitter What is streaming map-reduce? 8 Lookup Service Source Source Merge SumByKey Map Map We can push single data objects from either of the sources, all the way through the topology => Conceptually, state can be updated incrementally.
  • @Twitter 9
  • @Twitter 10
  • @Twitter 11
  • Why do I want this?
  • @Twitter 13 1) If our model assumes streaming, one-at-a-time semantics, we can run this code in realtime (e.g. Storm) or in offline/ batch (e.g. Hadoop, Tez, Spark).
  • @Twitter Again: Summingbird is a portability and abstraction layer 14 Summingbird allows you to write your job logic once, and change the backend as needed. Go from batch to realtime, from Storm to Spark Streaming (eventually), from Hadoop to Spark, from Spark to Tez (soon).
  • @Twitter 15 2) We have optimizers at the summingbird layer, and leverage those optimizers across platforms (combining joins, map-side combiners, data-cubing optimizations).
  • @Twitter 16 3) If we restrict our reduce operators to a very general class, we can automatically build a lambda architecture system.
  • What is the Lambda Architecture?
  • @Twitter Lambda Architecture. @nathanmarz http://lambda-architecture.net 18
  • But how do you build a lambda architecture?
  • @Twitter All Hail the Monoid (associative operator) 20 2 + 3 = 61 +
  • @Twitter All Hail the Monoid 21 2 + 3 = 61 + = 5 All Hail the Monoid (associative operator)
  • @Twitter All Hail the Monoid 22 2 + 3 = 61 + = 3 All Hail the Monoid (associative operator)
  • @Twitter Example Monoids 23 • (a min b) min c = a min (b min c) • (a max b) max c = a max (b max c) • (a or b) or c = a or (b or c) • addition: (a + b) + c = a + (b + c) • set union: (a u b) u c = a u (b u c) • set intersection: (a n b) n c = a n (b n c) • harmonic sum: 1/(1/a + 1/b) • approximate unique count (HLL), approximate counter (CMS) • and vectors: [a1, a2] max [b1, b2] = [a1 max b1, a2 max b2]
  • @Twitter Batching and associativity yields reliability Batch0 Batch1 Batch2 Batch3 fault tolerant Noisy Realtime sums from 0, each batch Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RT 24 RT RT RT RT
  • @Twitter Batching and associativity yields reliability Batch0 Batch1 Batch2 Batch3 fault tolerant Noisy Realtime sums from 0, each batch Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RT 25 Hadoop keeps a total sum (reliably) RT RT RT RT
  • @Twitter Batching and associativity yields reliability Batch0 Batch1 Batch2 Batch3 fault tolerant Noisy Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RT 26 Sum of RT Batch(i) + Hadoop Batch (i-1) has bounded noise, bounded read/write size. Done at query time
  • @Twitter Lambda Architecture with Summingbird and Storehaus 27 Summingbird-scalding Summingbird-storm storehaus-memcache storehaus-algebra storehaus-hbase Kafka
  • @Twitter What has Twitter built with this? 28 * realtime dashboards: ads, operations, publishers. * stream transformation: filtering, mapping, joining then exporting * building realtime features for ML models. * top-K applications: most viewed, most clicked, etc..
  • @Twitter 29
  • f f f + + + + + Tweets (Flat)Mappers Reducers HDFS/Queue HDFS/Queue [(tweetid, CMS(domain -> 1)), (0, CMS(tweetid -> 1))] reduce: (x,y) => sum CMS tables (x,y) groupBy tweetid
  • @Twitter 31 • The CMS is fixed size, so it never blows up. • delta = 1%, eps = 0.1% gives table size ~5000. • Can query any (tweetid, 0 == all) for counts. • Can simultaneously keep track of the keys with the highest counts (heavy- hitters). • Using heavy-hitters, you can see top embedded tweets. • Add a time-bucket to the key for keeping history.
  • @Twitter Review: @Summingbird is: 32 1) Portability/Optimization layer: write once, run on many platforms 2) Systematic implementation of Lambda Architecture: easy fault tolerance, no design needed. 3) Real-world & high throughput.
  • @Twitter Resources 33 twitter: @summingbird mail: summingbird@groups.google.com irc: freenode/#summingbird github.com/twitter/summingbird
  • @Twitter Join us! 34 Twitter is hiring people to use and develop @scalding and @summingbird to build realtime analytics and ML. twitter: @posco email: oscar at twitter
  • Thank you!