8. @Twitter
What is streaming map-reduce?
8
Lookup Service
Source Source
Merge
SumByKey
Map
Map
We can push single data
objects from either of the
sources, all the way
through the topology =>
Conceptually, state
can be updated
incrementally.
13. @Twitter 13
1) If our model assumes
streaming, one-at-a-time
semantics, we can run this code in
realtime (e.g. Storm) or in offline/
batch (e.g. Hadoop, Tez, Spark).
14. @Twitter
Again: Summingbird is a portability and abstraction layer
14
Summingbird allows you to write your job logic
once, and change the backend as needed.
Go from batch to realtime, from Storm to
Spark Streaming (eventually), from Hadoop to
Spark, from Spark to Tez (soon).
15. @Twitter 15
2) We have optimizers at the
summingbird layer, and leverage
those optimizers across platforms
(combining joins, map-side
combiners, data-cubing
optimizations).
16. @Twitter 16
3) If we restrict our reduce
operators to a very general class,
we can automatically build a
lambda architecture system.
21. @Twitter
All Hail the Monoid
21
2 + 3 = 61 +
=
5
All Hail the Monoid (associative operator)
22. @Twitter
All Hail the Monoid
22
2 + 3 = 61 +
=
3
All Hail the Monoid (associative operator)
23. @Twitter
Example Monoids
23
• (a min b) min c = a min (b min c)
• (a max b) max c = a max (b max c)
• (a or b) or c = a or (b or c)
• addition: (a + b) + c = a + (b + c)
• set union: (a u b) u c = a u (b u c)
• set intersection: (a n b) n c = a n (b n c)
• harmonic sum: 1/(1/a + 1/b)
• approximate unique count (HLL), approximate counter (CMS)
• and vectors: [a1, a2] max [b1, b2] = [a1 max b1, a2 max b2]
25. @Twitter
Batching and associativity yields reliability
Batch0 Batch1 Batch2 Batch3
fault tolerant
Noisy
Realtime sums
from 0, each batch
Log
Hadoop Hadoop Hadoop Hadoop
Log Log Log
RT RT RT RT
25
Hadoop keeps a
total sum
(reliably)
RT RT RT RT
26. @Twitter
Batching and associativity yields reliability
Batch0 Batch1 Batch2 Batch3
fault tolerant
Noisy
Log
Hadoop Hadoop Hadoop Hadoop
Log Log Log
RT RT RT RT
26
Sum of RT
Batch(i) +
Hadoop Batch
(i-1)
has bounded
noise, bounded
read/write size.
Done at query
time
27. @Twitter
Lambda Architecture with Summingbird and Storehaus
27
Summingbird-scalding
Summingbird-storm
storehaus-memcache
storehaus-algebra
storehaus-hbase
Kafka
28. @Twitter
What has Twitter built with this?
28
* realtime dashboards: ads, operations,
publishers.
* stream transformation: filtering, mapping,
joining then exporting
* building realtime features for ML models.
* top-K applications: most viewed, most
clicked, etc..
30. f f f
+ + + + +
Tweets
(Flat)Mappers
Reducers
HDFS/Queue
HDFS/Queue
[(tweetid, CMS(domain -> 1)),
(0, CMS(tweetid -> 1))]
reduce: (x,y) =>
sum CMS tables
(x,y)
groupBy tweetid
31. @Twitter 31
• The CMS is fixed size, so it never blows up.
• delta = 1%, eps = 0.1% gives table size ~5000.
• Can query any (tweetid, 0 == all) for counts.
• Can simultaneously keep track of the keys with the highest counts (heavy-
hitters).
• Using heavy-hitters, you can see top embedded tweets.
• Add a time-bucket to the key for keeping history.
32. @Twitter
Review: @Summingbird is:
32
1) Portability/Optimization layer:
write once, run on many platforms
2) Systematic implementation of
Lambda Architecture: easy fault
tolerance, no design needed.
3) Real-world & high throughput.
34. @Twitter
Join us!
34
Twitter is hiring people to use and develop @scalding
and @summingbird to build realtime analytics and ML.
twitter: @posco
email: oscar at twitter