SlideShare a Scribd company logo
1 of 35
Summingbird:
streaming portable map-reduce
Oscar Boykin | Twitter | @posco | @summingbird
@Twitter
What is summingbird?
2
1) Model for
streaming multi-
stage map-reduce
@Twitter
What is summingbird?
3
2) Implementations
to run this model
on Storm, Hadoop,
Spark and soon
@Twitter
What is summingbird?
4
2) Implementations
to run this model
on Storm, Hadoop,
Spark and soon
Portable
@Twitter
What is summingbird?
5
3) Systematic
implementation of
the “Lambda
Architecture”
@Twitter
What is summingbird?
6
3) Systematic
implementation of
the “Lambda
Architecture”
Fault Tolerant
@Twitter
What is streaming map-reduce?
7
Service
Source Source
Merge
SumByKey
Map
Map
Lookup
@Twitter
What is streaming map-reduce?
8
Lookup Service
Source Source
Merge
SumByKey
Map
Map
We can push single data
objects from either of the
sources, all the way
through the topology =>
Conceptually, state
can be updated
incrementally.
@Twitter 9
@Twitter 10
@Twitter 11
Why do I want this?
@Twitter 13
1) If our model assumes
streaming, one-at-a-time
semantics, we can run this code in
realtime (e.g. Storm) or in offline/
batch (e.g. Hadoop, Tez, Spark).
@Twitter
Again: Summingbird is a portability and abstraction layer
14
Summingbird allows you to write your job logic
once, and change the backend as needed.
Go from batch to realtime, from Storm to
Spark Streaming (eventually), from Hadoop to
Spark, from Spark to Tez (soon).
@Twitter 15
2) We have optimizers at the
summingbird layer, and leverage
those optimizers across platforms
(combining joins, map-side
combiners, data-cubing
optimizations).
@Twitter 16
3) If we restrict our reduce
operators to a very general class,
we can automatically build a
lambda architecture system.
What is the Lambda Architecture?
@Twitter
Lambda Architecture. @nathanmarz
http://lambda-architecture.net
18
But how do you build a lambda architecture?
@Twitter
All Hail the Monoid (associative operator)
20
2 + 3 = 61 +
@Twitter
All Hail the Monoid
21
2 + 3 = 61 +
=
5
All Hail the Monoid (associative operator)
@Twitter
All Hail the Monoid
22
2 + 3 = 61 +
=
3
All Hail the Monoid (associative operator)
@Twitter
Example Monoids
23
• (a min b) min c = a min (b min c)
• (a max b) max c = a max (b max c)
• (a or b) or c = a or (b or c)
• addition: (a + b) + c = a + (b + c)
• set union: (a u b) u c = a u (b u c)
• set intersection: (a n b) n c = a n (b n c)
• harmonic sum: 1/(1/a + 1/b)
• approximate unique count (HLL), approximate counter (CMS)
• and vectors: [a1, a2] max [b1, b2] = [a1 max b1, a2 max b2]
@Twitter
Batching and associativity yields reliability
Batch0 Batch1 Batch2 Batch3
fault tolerant
Noisy
Realtime sums
from 0, each batch
Log
Hadoop Hadoop Hadoop Hadoop
Log Log Log
RT RT RT RT
24
RT RT RT RT
@Twitter
Batching and associativity yields reliability
Batch0 Batch1 Batch2 Batch3
fault tolerant
Noisy
Realtime sums
from 0, each batch
Log
Hadoop Hadoop Hadoop Hadoop
Log Log Log
RT RT RT RT
25
Hadoop keeps a
total sum
(reliably)
RT RT RT RT
@Twitter
Batching and associativity yields reliability
Batch0 Batch1 Batch2 Batch3
fault tolerant
Noisy
Log
Hadoop Hadoop Hadoop Hadoop
Log Log Log
RT RT RT RT
26
Sum of RT
Batch(i) +
Hadoop Batch
(i-1)
has bounded
noise, bounded
read/write size.
Done at query
time
@Twitter
Lambda Architecture with Summingbird and Storehaus
27
Summingbird-scalding
Summingbird-storm
storehaus-memcache
storehaus-algebra
storehaus-hbase
Kafka
@Twitter
What has Twitter built with this?
28
* realtime dashboards: ads, operations,
publishers.
* stream transformation: filtering, mapping,
joining then exporting
* building realtime features for ML models.
* top-K applications: most viewed, most
clicked, etc..
@Twitter 29
f f f
+ + + + +
Tweets
(Flat)Mappers
Reducers
HDFS/Queue
HDFS/Queue
[(tweetid, CMS(domain -> 1)),
(0, CMS(tweetid -> 1))]
reduce: (x,y) =>
sum CMS tables
(x,y)
groupBy tweetid
@Twitter 31
• The CMS is fixed size, so it never blows up.
• delta = 1%, eps = 0.1% gives table size ~5000.
• Can query any (tweetid, 0 == all) for counts.
• Can simultaneously keep track of the keys with the highest counts (heavy-
hitters).
• Using heavy-hitters, you can see top embedded tweets.
• Add a time-bucket to the key for keeping history.
@Twitter
Review: @Summingbird is:
32
1) Portability/Optimization layer:
write once, run on many platforms
2) Systematic implementation of
Lambda Architecture: easy fault
tolerance, no design needed.
3) Real-world & high throughput.
@Twitter
Resources
33
twitter: @summingbird
mail: summingbird@groups.google.com
irc: freenode/#summingbird
github.com/twitter/summingbird
@Twitter
Join us!
34
Twitter is hiring people to use and develop @scalding
and @summingbird to build realtime analytics and ML.
twitter: @posco
email: oscar at twitter
Thank you!

More Related Content

What's hot

Graphite, an introduction
Graphite, an introductionGraphite, an introduction
Graphite, an introductionjamesrwu
 
Sourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache BeamSourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache BeamPyData
 
Using Kafka to integrate DWH and Cloud Based big data systems
Using Kafka to integrate DWH and Cloud Based big data systemsUsing Kafka to integrate DWH and Cloud Based big data systems
Using Kafka to integrate DWH and Cloud Based big data systemsconfluent
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniquesLars Albertsson
 
Hw09 Hadoop Applications At Yahoo!
Hw09   Hadoop Applications At Yahoo!Hw09   Hadoop Applications At Yahoo!
Hw09 Hadoop Applications At Yahoo!Cloudera, Inc.
 
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...
Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data...Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data...
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...Flink Forward
 
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...PROIDEA
 
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用Simon Su
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemDanny Yuan
 
The next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engineThe next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engineG. Bruce Berriman
 
Community-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraphCommunity-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraphJason Plurad
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Databricks
 
Google Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teamsGoogle Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teamsBarton Rhodes
 
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...Josef A. Habdank
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkTran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkFlink Forward
 
Graph Processing with Titan and Scylla
Graph Processing with Titan and ScyllaGraph Processing with Titan and Scylla
Graph Processing with Titan and ScyllaJason Plurad
 
QCon SF-2015 Stream Processing in uber
QCon SF-2015 Stream Processing in uberQCon SF-2015 Stream Processing in uber
QCon SF-2015 Stream Processing in uberDanny Yuan
 

What's hot (20)

Graphite, an introduction
Graphite, an introductionGraphite, an introduction
Graphite, an introduction
 
Sourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache BeamSourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache Beam
 
Using Kafka to integrate DWH and Cloud Based big data systems
Using Kafka to integrate DWH and Cloud Based big data systemsUsing Kafka to integrate DWH and Cloud Based big data systems
Using Kafka to integrate DWH and Cloud Based big data systems
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniques
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
Hw09 Hadoop Applications At Yahoo!
Hw09   Hadoop Applications At Yahoo!Hw09   Hadoop Applications At Yahoo!
Hw09 Hadoop Applications At Yahoo!
 
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...
Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data...Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data...
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...
 
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
 
Graphite
GraphiteGraphite
Graphite
 
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
 
The next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engineThe next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engine
 
Community-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraphCommunity-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraph
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
 
Google Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teamsGoogle Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teams
 
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkTran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
 
Graph Processing with Titan and Scylla
Graph Processing with Titan and ScyllaGraph Processing with Titan and Scylla
Graph Processing with Titan and Scylla
 
QCon SF-2015 Stream Processing in uber
QCon SF-2015 Stream Processing in uberQCon SF-2015 Stream Processing in uber
QCon SF-2015 Stream Processing in uber
 
Zentral QueryCon 2018
Zentral QueryCon 2018Zentral QueryCon 2018
Zentral QueryCon 2018
 

Similar to Summingbird: Streaming Portable, MapReduce

Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
 
Kyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfFlavio W. Brasil
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.pptrveiga100
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAlbert Bifet
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormDataStax
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceDataWorks Summit/Hadoop Summit
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
 
Algebraic data types: Semilattices
Algebraic data types: SemilatticesAlgebraic data types: Semilattices
Algebraic data types: SemilatticesBernhard Huemer
 
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardBrian O'Neill
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure DataTaro L. Saito
 
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14Yuichiro Yasui
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
 
Streaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+TablesStreaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+TablesC4Media
 
SC20 SYCL and C++ Birds of a Feather 19th Nov 2020
SC20 SYCL and C++ Birds of a Feather 19th Nov 2020SC20 SYCL and C++ Birds of a Feather 19th Nov 2020
SC20 SYCL and C++ Birds of a Feather 19th Nov 2020rodburns
 
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinReal Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinGuido Schmutz
 

Similar to Summingbird: Streaming Portable, MapReduce (20)

Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
Kyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdf
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
Tweet Cloud
Tweet CloudTweet Cloud
Tweet Cloud
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Algebraic data types: Semilattices
Algebraic data types: SemilatticesAlgebraic data types: Semilattices
Algebraic data types: Semilattices
 
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
 
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
 
Pregel
PregelPregel
Pregel
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Streaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+TablesStreaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+Tables
 
SC20 SYCL and C++ Birds of a Feather 19th Nov 2020
SC20 SYCL and C++ Birds of a Feather 19th Nov 2020SC20 SYCL and C++ Birds of a Feather 19th Nov 2020
SC20 SYCL and C++ Birds of a Feather 19th Nov 2020
 
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinReal Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
 
Move from C to Go
Move from C to GoMove from C to Go
Move from C to Go
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Recently uploaded (20)

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Summingbird: Streaming Portable, MapReduce

  • 1. Summingbird: streaming portable map-reduce Oscar Boykin | Twitter | @posco | @summingbird
  • 2. @Twitter What is summingbird? 2 1) Model for streaming multi- stage map-reduce
  • 3. @Twitter What is summingbird? 3 2) Implementations to run this model on Storm, Hadoop, Spark and soon
  • 4. @Twitter What is summingbird? 4 2) Implementations to run this model on Storm, Hadoop, Spark and soon Portable
  • 5. @Twitter What is summingbird? 5 3) Systematic implementation of the “Lambda Architecture”
  • 6. @Twitter What is summingbird? 6 3) Systematic implementation of the “Lambda Architecture” Fault Tolerant
  • 7. @Twitter What is streaming map-reduce? 7 Service Source Source Merge SumByKey Map Map Lookup
  • 8. @Twitter What is streaming map-reduce? 8 Lookup Service Source Source Merge SumByKey Map Map We can push single data objects from either of the sources, all the way through the topology => Conceptually, state can be updated incrementally.
  • 12. Why do I want this?
  • 13. @Twitter 13 1) If our model assumes streaming, one-at-a-time semantics, we can run this code in realtime (e.g. Storm) or in offline/ batch (e.g. Hadoop, Tez, Spark).
  • 14. @Twitter Again: Summingbird is a portability and abstraction layer 14 Summingbird allows you to write your job logic once, and change the backend as needed. Go from batch to realtime, from Storm to Spark Streaming (eventually), from Hadoop to Spark, from Spark to Tez (soon).
  • 15. @Twitter 15 2) We have optimizers at the summingbird layer, and leverage those optimizers across platforms (combining joins, map-side combiners, data-cubing optimizations).
  • 16. @Twitter 16 3) If we restrict our reduce operators to a very general class, we can automatically build a lambda architecture system.
  • 17. What is the Lambda Architecture?
  • 19. But how do you build a lambda architecture?
  • 20. @Twitter All Hail the Monoid (associative operator) 20 2 + 3 = 61 +
  • 21. @Twitter All Hail the Monoid 21 2 + 3 = 61 + = 5 All Hail the Monoid (associative operator)
  • 22. @Twitter All Hail the Monoid 22 2 + 3 = 61 + = 3 All Hail the Monoid (associative operator)
  • 23. @Twitter Example Monoids 23 • (a min b) min c = a min (b min c) • (a max b) max c = a max (b max c) • (a or b) or c = a or (b or c) • addition: (a + b) + c = a + (b + c) • set union: (a u b) u c = a u (b u c) • set intersection: (a n b) n c = a n (b n c) • harmonic sum: 1/(1/a + 1/b) • approximate unique count (HLL), approximate counter (CMS) • and vectors: [a1, a2] max [b1, b2] = [a1 max b1, a2 max b2]
  • 24. @Twitter Batching and associativity yields reliability Batch0 Batch1 Batch2 Batch3 fault tolerant Noisy Realtime sums from 0, each batch Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RT 24 RT RT RT RT
  • 25. @Twitter Batching and associativity yields reliability Batch0 Batch1 Batch2 Batch3 fault tolerant Noisy Realtime sums from 0, each batch Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RT 25 Hadoop keeps a total sum (reliably) RT RT RT RT
  • 26. @Twitter Batching and associativity yields reliability Batch0 Batch1 Batch2 Batch3 fault tolerant Noisy Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RT 26 Sum of RT Batch(i) + Hadoop Batch (i-1) has bounded noise, bounded read/write size. Done at query time
  • 27. @Twitter Lambda Architecture with Summingbird and Storehaus 27 Summingbird-scalding Summingbird-storm storehaus-memcache storehaus-algebra storehaus-hbase Kafka
  • 28. @Twitter What has Twitter built with this? 28 * realtime dashboards: ads, operations, publishers. * stream transformation: filtering, mapping, joining then exporting * building realtime features for ML models. * top-K applications: most viewed, most clicked, etc..
  • 30. f f f + + + + + Tweets (Flat)Mappers Reducers HDFS/Queue HDFS/Queue [(tweetid, CMS(domain -> 1)), (0, CMS(tweetid -> 1))] reduce: (x,y) => sum CMS tables (x,y) groupBy tweetid
  • 31. @Twitter 31 • The CMS is fixed size, so it never blows up. • delta = 1%, eps = 0.1% gives table size ~5000. • Can query any (tweetid, 0 == all) for counts. • Can simultaneously keep track of the keys with the highest counts (heavy- hitters). • Using heavy-hitters, you can see top embedded tweets. • Add a time-bucket to the key for keeping history.
  • 32. @Twitter Review: @Summingbird is: 32 1) Portability/Optimization layer: write once, run on many platforms 2) Systematic implementation of Lambda Architecture: easy fault tolerance, no design needed. 3) Real-world & high throughput.
  • 33. @Twitter Resources 33 twitter: @summingbird mail: summingbird@groups.google.com irc: freenode/#summingbird github.com/twitter/summingbird
  • 34. @Twitter Join us! 34 Twitter is hiring people to use and develop @scalding and @summingbird to build realtime analytics and ML. twitter: @posco email: oscar at twitter