Your SlideShare is downloading.
×

×
# Saving this for later?

### Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

#### Text the download link to your phone

Standard text messaging rates apply

Like this presentation? Why not share!

- Similarity at Scale by Hadoop_Summit 1079 views
- Recent Developments in Spark MLlib ... by Hadoop_Summit 3851 views
- Presto @ Facebook: Past, Present an... by Hadoop_Summit 2762 views
- Hive + Tez: A Performance Deep Dive by Hadoop_Summit 4903 views
- Apache Falcon - Simplifying Managin... by Hadoop_Summit 4180 views
- Data Platform at Twitter: Enabling ... by Sriram Krishnan 531 views
- Managing 2000 Node Cluster with Ambari by Hadoop_Summit 4548 views
- Apache Tez - A New Chapter in Hadoo... by Hadoop_Summit 3784 views
- Apache Hadoop YARN: best practices by Hadoop_Summit 4161 views
- Building a unified data pipeline in... by Hadoop_Summit 4849 views
- Lambda Architecture - Storm, Triden... by DATAIKU 3442 views
- One Grid to rule them all: Building... by Hadoop_Summit 1128 views

Like this? Share it with your network
Share

No Downloads

Total Views

1,700

On Slideshare

0

From Embeds

0

Number of Embeds

2

Shares

0

Downloads

0

Comments

0

Likes

20

No embeds

No notes for slide

- 1. Summingbird: streaming portable map-reduce Oscar Boykin | Twitter | @posco | @summingbird
- 2. @Twitter What is summingbird? 2 1) Model for streaming multi- stage map-reduce
- 3. @Twitter What is summingbird? 3 2) Implementations to run this model on Storm, Hadoop, Spark and soon
- 4. @Twitter What is summingbird? 4 2) Implementations to run this model on Storm, Hadoop, Spark and soon Portable
- 5. @Twitter What is summingbird? 5 3) Systematic implementation of the “Lambda Architecture”
- 6. @Twitter What is summingbird? 6 3) Systematic implementation of the “Lambda Architecture” Fault Tolerant
- 7. @Twitter What is streaming map-reduce? 7 Service Source Source Merge SumByKey Map Map Lookup
- 8. @Twitter What is streaming map-reduce? 8 Lookup Service Source Source Merge SumByKey Map Map We can push single data objects from either of the sources, all the way through the topology => Conceptually, state can be updated incrementally.
- 9. @Twitter 9
- 10. @Twitter 10
- 11. @Twitter 11
- 12. Why do I want this?
- 13. @Twitter 13 1) If our model assumes streaming, one-at-a-time semantics, we can run this code in realtime (e.g. Storm) or in ofﬂine/ batch (e.g. Hadoop, Tez, Spark).
- 14. @Twitter Again: Summingbird is a portability and abstraction layer 14 Summingbird allows you to write your job logic once, and change the backend as needed. Go from batch to realtime, from Storm to Spark Streaming (eventually), from Hadoop to Spark, from Spark to Tez (soon).
- 15. @Twitter 15 2) We have optimizers at the summingbird layer, and leverage those optimizers across platforms (combining joins, map-side combiners, data-cubing optimizations).
- 16. @Twitter 16 3) If we restrict our reduce operators to a very general class, we can automatically build a lambda architecture system.
- 17. What is the Lambda Architecture?
- 18. @Twitter Lambda Architecture. @nathanmarz http://lambda-architecture.net 18
- 19. But how do you build a lambda architecture?
- 20. @Twitter All Hail the Monoid (associative operator) 20 2 + 3 = 61 +
- 21. @Twitter All Hail the Monoid 21 2 + 3 = 61 + = 5 All Hail the Monoid (associative operator)
- 22. @Twitter All Hail the Monoid 22 2 + 3 = 61 + = 3 All Hail the Monoid (associative operator)
- 23. @Twitter Example Monoids 23 • (a min b) min c = a min (b min c) • (a max b) max c = a max (b max c) • (a or b) or c = a or (b or c) • addition: (a + b) + c = a + (b + c) • set union: (a u b) u c = a u (b u c) • set intersection: (a n b) n c = a n (b n c) • harmonic sum: 1/(1/a + 1/b) • approximate unique count (HLL), approximate counter (CMS) • and vectors: [a1, a2] max [b1, b2] = [a1 max b1, a2 max b2]
- 24. @Twitter Batching and associativity yields reliability Batch0 Batch1 Batch2 Batch3 fault tolerant Noisy Realtime sums from 0, each batch Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RT 24 RT RT RT RT
- 25. @Twitter Batching and associativity yields reliability Batch0 Batch1 Batch2 Batch3 fault tolerant Noisy Realtime sums from 0, each batch Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RT 25 Hadoop keeps a total sum (reliably) RT RT RT RT
- 26. @Twitter Batching and associativity yields reliability Batch0 Batch1 Batch2 Batch3 fault tolerant Noisy Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RT 26 Sum of RT Batch(i) + Hadoop Batch (i-1) has bounded noise, bounded read/write size. Done at query time
- 27. @Twitter Lambda Architecture with Summingbird and Storehaus 27 Summingbird-scalding Summingbird-storm storehaus-memcache storehaus-algebra storehaus-hbase Kafka
- 28. @Twitter What has Twitter built with this? 28 * realtime dashboards: ads, operations, publishers. * stream transformation: ﬁltering, mapping, joining then exporting * building realtime features for ML models. * top-K applications: most viewed, most clicked, etc..
- 29. @Twitter 29
- 30. f f f + + + + + Tweets (Flat)Mappers Reducers HDFS/Queue HDFS/Queue [(tweetid, CMS(domain -> 1)), (0, CMS(tweetid -> 1))] reduce: (x,y) => sum CMS tables (x,y) groupBy tweetid
- 31. @Twitter 31 • The CMS is ﬁxed size, so it never blows up. • delta = 1%, eps = 0.1% gives table size ~5000. • Can query any (tweetid, 0 == all) for counts. • Can simultaneously keep track of the keys with the highest counts (heavy- hitters). • Using heavy-hitters, you can see top embedded tweets. • Add a time-bucket to the key for keeping history.
- 32. @Twitter Review: @Summingbird is: 32 1) Portability/Optimization layer: write once, run on many platforms 2) Systematic implementation of Lambda Architecture: easy fault tolerance, no design needed. 3) Real-world & high throughput.
- 33. @Twitter Resources 33 twitter: @summingbird mail: summingbird@groups.google.com irc: freenode/#summingbird github.com/twitter/summingbird
- 34. @Twitter Join us! 34 Twitter is hiring people to use and develop @scalding and @summingbird to build realtime analytics and ML. twitter: @posco email: oscar at twitter
- 35. Thank you!

Be the first to comment