SlideShare a Scribd company logo
1 of 50
Download to read offline
Apache Gearpump
next-gen streaming engine
Manu Zhang, mauzhang@apache.org
Sean Zhong, sean-zhong@apache.org
What is Gearpump ?
● An Akka[2]
based real-time streaming engine
● An Apache Incubator[1]
project since Mar.8th, 2016
● A super simple pump that consists of only two gears
but very powerful at streaming water
Gearpump
Architecture
● Actor concurrency
● Message passing
communication
● error handling and
isolation with
supervision hierarchy
● Master HA with Akka
Cluster
Why Gearpump ?
Stream processing is hard but no system
meets all requirements
● Infinite data
● Out-of-order data
● Low latency assurance (e.g real-time recommendation)
● Correctness requirement (e.g. charge advertisers for ads)
● Cheap to update user logics
How Gearpump solves the hard parts
● User interface
● Flow control
● Out-of-order processing
● Exactly once
● Dynamic DAG
How Gearpump solves the hard parts
● User Interface
● Flow control
● Out-of-order processing
● Exactly Once
● Dynamic DAG
User Interface - DAG API
Graph(
KafkaSource ~ ShufflePartitioner ~> WindowCounter,
WindowCounter ~ HashPartitioner ~> ModelCalculator,
WindowCounter ~ HashPartitioner ~> AnomalyDetector,
ModelCalculator ~> HashPartitioner ~> AnomalyDetector,
AnomalyDetector ~> ShufflePartitioner ~> KafkaSink)
partitioner
User Interface - Partitioner
Graph(
KafkaSource ~ ShufflePartitioner ~> WindowCounter,
WindowCounter ~ HashPartitioner ~> ModelCalculator,
WindowCounter ~ HashPartitioner ~> AnomalyDetector,
ModelCalculator ~ HashPartitioner ~> AnomalyDetector,
AnomalyDetector ~ ShufflePartitioner ~> KafkaSink)
WindowCounter
(parallelism = 3)
ModelCalculator
(parallelism = 2)
Task
Task
Task
Task
Task
User Interface - Compose DAG
How Gearpump solves the hard parts
● User interface
● Flow control
● Out-of-order processing
● Exactly once
● Dynamic DAG
Flow Control
1. Words
3. W
indow
Counts
2. Window Counts
4. Models
5.Anom
aly
words
KafkaSource WindowCounter
ModelCalculator
AnomalyDetector
KafkaSink
Without Flow Control - messages queue up
KafkaSource WindowCounter ModelCalculator
fast fast very slow
Without Flow Control - OOM
KafkaSource WindowCounter ModelCalculator
fast fast very slow
With Flow Control - Backpressure
KafkaSource WindowCounter ModelCalculator
fast slow down very slow
backpressure
With Flow Control - Backpressure
KafkaSource WindowCounter ModelCalculator
Slow down Slow down Very Slow
backpressurebackpressure
pull slower
How Gearpump solves the hard parts
● User Interface
● Flow control
● Out-of-order processing
● Exactly Once
● Dynamic DAG
Out-of-order data
Event time
321
Processingtime
● Event time - when data generated
● Processing time - when data processed
1
2
3
4
5
6
Window Count
1. Words
3. W
indow
Counts
2. Window Counts
4. Models
5.Anom
aly
words
KafkaSource WindowCounter
ModelCalculator
AnomalyDetector
KafkaSink
When can window counts be emitted ?
window messages count
[0, 2) 1
[2, 4)
2
[4, 6)
2
(“gearpump”, 1)
WindowCounter In-memory Table
● No window can be emitted since
message as early as time 1 has
not arrived
(“gearpump”, 1)
(“gearpump”, 3)
(“gearpump”, 5)
(“gearpump”, 4)
(“gearpump”, 2)
WindowCounter
On Watermark[4]
Event time
321
Processingtime
1
2
3
4
5
6 watermark
● No timestamp earlier than
watermark will be seen
Out-of-order processing with watermark
window messages count
[0, 2) 1
[2, 4)
2
[4, 6)
2
(“gearpump”, 1)
WindowCounter In-memory Table
(“gearpump”, 1)
(“gearpump”, 3)
(“gearpump”, 5)
(“gearpump”, 4)
(“gearpump”, 2)
WindowCounter
watermark = 0,
No window can be emitted
Out-of-order processing with watermark
window messages count
[0, 2) 1
[2, 4)
2
[4, 6)
2
(“gearpum
p”, [0, 2), 1)
WindowCounter In-memory Table
(“gearpump”, 1)
(“gearpump”, 3)
(“gearpump”, 5)
(“gearpump”, 4)
(“gearpump”, 2)
WindowCounter
watermark = 2,
Window [0, 2) can be emitted
(“gearpump”, [0, 2), 1)
How to get watermark ?
1. Words
3. W
indow
Counts
2. Window Counts
4. Models
5.Anom
aly
words
KafkaSource WindowCounter
ModelCalculator
AnomalyDetector
KafkaSink
From upstream
1. Words
3. W
indow
Counts
2. Window Counts
4. Models
5.Anom
aly
words
KafkaSource WindowCounter
ModelCalculator
AnomalyDetector
KafkaSink
watermark
More on Watermark
● Source watermark defined by user
● Usually heuristic based
● Users decide whether to drop data arriving after
watermark
How Gearpump solves the hard parts
● User Interface
● Flow control
● Out-of-order processing
● Exactly once
● Dynamic DAG
Exactly Once - no Lost or Duplicated States
● Lost states can cause
anomaly words skipped
● Duplicated states can cause
false alarm
States
4. Models
5.Anom
aly
words
KafkaSource WindowCounter
ModelCalculator
AnomalyDetector
KafkaSink
kafka_offset window_counts
models
Exactly Once with asynchronous checkpointing
(2, kafka_offset)
KafkaSource WindowCounter ModelCalculator
Watermark = 2 Watermark = 0 Watermark = 0
(2, kafka_offset)
(2, kafka_offset)
(2, window_counts)
Exactly Once with asynchronous checkpointing
KafkaSource WindowCounter ModelCalculator
Watermark = 2 Watermark = 2 Watermark = 0
(2, window_counts)
(2, kafka_offset)
(2, window_counts)
(2, models)
Exactly Once with asynchronous checkpointing
KafkaSource WindowCounter ModelCalculator
Watermark = 2 Watermark = 2 Watermark = 2
(2, models)
(2, kafka_offset)
(2, window_counts)
(2, models)
Exactly Once with asynchronous checkpointing
KafkaSource WindowCounter ModelCalculator
Watermark = 2 Watermark = 2 Watermark = 2
(2, models)
(2, kafka_offset)
(2, window_counts)
(2, models)
Crash
KafkaSource WindowCounter ModelCalculator
Watermark = 2 Watermark = 2 Watermark = 2
(2, models)
(2, kafka_offset)
(2, window_counts)
(2, models)
Recover to latest checkpoint at 2
KafkaSource WindowCounter ModelCalculator
modelswindow_countskafka_offset
Replay from kafka
Get state at 2
More on Exactly Once
● User extends PersistentState interface and system
checkpoints state for users automatically
● Watermark is persisted on master
How Gearpump solves the hard parts
● User Interface
● Flow control
● Out-of-order processing
● Exactly Once
● Dynamic DAG
Can we update model algorithm in a cheap way ?
1. Words
3. W
indow
Counts
2. Window Counts
4. Models
5.Anom
aly
words
KafkaSource WindowCounter
ModelCalculator
AnomalyDetector
KafkaSink
The model algorithm
doesn’t perform well
Dynamic DAG - modify application at runtime
Performance
Yahoo! Streaming Benchmark[5]
Yahoo! Streaming
Benchmark
● 4 node cluster
● Each node has
32-core, 64 GB
memory
● 10GB Ethernet
Advanced features
DAG Visualization
● Watermark
● Node size reflects throughput
● Edge width represents flow rate
DAG Visualization
● Data skew analysis
Apache Beam[6]
Gearpump Runner
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution
Apache
Gearpump
Experimental features
● Web UI Authorization / OAuth2 Authentication
● CGroup Resource Isolation
● Binary Storm compatibility
● Akka Streams integration
Summary
● Gearpump is good at streaming infinite out-of-order
data and guarantees correctness
● Gearpump helps users to easily program streaming
applications, get runtime information and update
dynamically
References
1. gearpump.apache.org
2. akka.io
3. http://www.slideshare.net/SeanZhong/strata-singapore-gearpumpreal-ti
me-dagprocessing-with-akka-at-scale
4. https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p734-akid
au.pdf
5. https://yahooeng.tumblr.com/post/135321837876/benchmarking-streami
ng-computation-engines-at
6. Apache Beam [project overview]
Backup Slides
What is Akka
● Message driven
● Shared nothing
● Location transparent
● Supervision based
failure management
● High performance
Actor
ActorActor
Actor Actor
spawn

More Related Content

What's hot

Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...Flink Forward
 
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch -  Dynami...Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch -  Dynami...
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...Flink Forward
 
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overviewFlink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overviewFlink Forward
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Julian Hyde
 
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacIntro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacApache Apex
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)Apache Apex
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...Flink Forward
 
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra TagareActionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra TagareApache Apex
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...Flink Forward
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentQuerying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentDataWorks Summit/Hadoop Summit
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniMonal Daxini
 
QCon London 2016 - Patterns of reliable in-stream processing @ Scale
QCon London 2016 - Patterns of reliable in-stream processing @ ScaleQCon London 2016 - Patterns of reliable in-stream processing @ Scale
QCon London 2016 - Patterns of reliable in-stream processing @ ScaleAlexey Kharlamov
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
 
Journey into Reactive Streams and Akka Streams
Journey into Reactive Streams and Akka StreamsJourney into Reactive Streams and Akka Streams
Journey into Reactive Streams and Akka StreamsKevin Webber
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...Flink Forward
 

What's hot (20)

Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...
 
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch -  Dynami...Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch -  Dynami...
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
 
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overviewFlink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
 
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacIntro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
 
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra TagareActionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentQuerying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxini
 
QCon London 2016 - Patterns of reliable in-stream processing @ Scale
QCon London 2016 - Patterns of reliable in-stream processing @ ScaleQCon London 2016 - Patterns of reliable in-stream processing @ Scale
QCon London 2016 - Patterns of reliable in-stream processing @ Scale
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Akka streams
Akka streamsAkka streams
Akka streams
 
ApacheCon BigData Europe 2015
ApacheCon BigData Europe 2015 ApacheCon BigData Europe 2015
ApacheCon BigData Europe 2015
 
Journey into Reactive Streams and Akka Streams
Journey into Reactive Streams and Akka StreamsJourney into Reactive Streams and Akka Streams
Journey into Reactive Streams and Akka Streams
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
 

Viewers also liked

File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
 
node-crate: node.js and big data
 node-crate: node.js and big data node-crate: node.js and big data
node-crate: node.js and big dataStefan Thies
 
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Impetus Technologies
 
Agile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAAgile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAPaolo Platter
 
Agile Lab_BigData_Meetup
Agile Lab_BigData_MeetupAgile Lab_BigData_Meetup
Agile Lab_BigData_MeetupPaolo Platter
 
iBeacon - I vantaggi nel mondo retail
iBeacon - I vantaggi nel mondo retailiBeacon - I vantaggi nel mondo retail
iBeacon - I vantaggi nel mondo retailEmanuele Cucci
 
Migration from Redshift to Spark
Migration from Redshift to SparkMigration from Redshift to Spark
Migration from Redshift to SparkSky Yin
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop frameworkTu Pham
 
Introduzione ai Big Data e alla scienza dei dati - Big Data
Introduzione ai Big Data e alla scienza dei dati - Big DataIntroduzione ai Big Data e alla scienza dei dati - Big Data
Introduzione ai Big Data e alla scienza dei dati - Big DataVincenzo Manzoni
 
Big Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureBig Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureOdinot Stanislas
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detectionhadooparchbook
 
R-IoT Retail IoT - Come le IoT influenzano il mercato del Retail
R-IoT Retail IoT - Come le IoT influenzano il mercato del RetailR-IoT Retail IoT - Come le IoT influenzano il mercato del Retail
R-IoT Retail IoT - Come le IoT influenzano il mercato del Retailriot-markets
 
週末輕鬆一下
週末輕鬆一下週末輕鬆一下
週末輕鬆一下honan4108
 
Evolución de los contenidos móviles
Evolución de los contenidos móvilesEvolución de los contenidos móviles
Evolución de los contenidos móvilesAlberto Vicentini
 
Kellton Tech Profile - Consumer and Enterprise Web Applications
Kellton Tech Profile - Consumer and Enterprise Web ApplicationsKellton Tech Profile - Consumer and Enterprise Web Applications
Kellton Tech Profile - Consumer and Enterprise Web ApplicationsKellton Tech Solutions Ltd
 
How world class companies optimize the digital experience
How world class companies optimize the digital experienceHow world class companies optimize the digital experience
How world class companies optimize the digital experienceQualtrics
 

Viewers also liked (18)

Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
node-crate: node.js and big data
 node-crate: node.js and big data node-crate: node.js and big data
node-crate: node.js and big data
 
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
 
Agile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAAgile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKA
 
Agile Lab_BigData_Meetup
Agile Lab_BigData_MeetupAgile Lab_BigData_Meetup
Agile Lab_BigData_Meetup
 
iBeacon - I vantaggi nel mondo retail
iBeacon - I vantaggi nel mondo retailiBeacon - I vantaggi nel mondo retail
iBeacon - I vantaggi nel mondo retail
 
Migration from Redshift to Spark
Migration from Redshift to SparkMigration from Redshift to Spark
Migration from Redshift to Spark
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
 
Introduzione ai Big Data e alla scienza dei dati - Big Data
Introduzione ai Big Data e alla scienza dei dati - Big DataIntroduzione ai Big Data e alla scienza dei dati - Big Data
Introduzione ai Big Data e alla scienza dei dati - Big Data
 
Big Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureBig Data and Implications on Platform Architecture
Big Data and Implications on Platform Architecture
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
R-IoT Retail IoT - Come le IoT influenzano il mercato del Retail
R-IoT Retail IoT - Come le IoT influenzano il mercato del RetailR-IoT Retail IoT - Come le IoT influenzano il mercato del Retail
R-IoT Retail IoT - Come le IoT influenzano il mercato del Retail
 
週末輕鬆一下
週末輕鬆一下週末輕鬆一下
週末輕鬆一下
 
Evolución de los contenidos móviles
Evolución de los contenidos móvilesEvolución de los contenidos móviles
Evolución de los contenidos móviles
 
Kellton Tech Profile - Consumer and Enterprise Web Applications
Kellton Tech Profile - Consumer and Enterprise Web ApplicationsKellton Tech Profile - Consumer and Enterprise Web Applications
Kellton Tech Profile - Consumer and Enterprise Web Applications
 
PPM and Governance Workshop
PPM and Governance WorkshopPPM and Governance Workshop
PPM and Governance Workshop
 
How world class companies optimize the digital experience
How world class companies optimize the digital experienceHow world class companies optimize the digital experience
How world class companies optimize the digital experience
 

Similar to Apache Gearpump next-gen streaming engine

Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingApache Apex
 
Flink 0.10 - Upcoming Features
Flink 0.10 - Upcoming FeaturesFlink 0.10 - Upcoming Features
Flink 0.10 - Upcoming FeaturesAljoscha Krettek
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationYi Pan
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
 
BigDataSpain 2016: Introduction to Apache Apex
BigDataSpain 2016: Introduction to Apache ApexBigDataSpain 2016: Introduction to Apache Apex
BigDataSpain 2016: Introduction to Apache ApexThomas Weise
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQXin Wang
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingApache Apex
 
Stream data from Apache Kafka for processing with Apache Apex
Stream data from Apache Kafka for processing with Apache ApexStream data from Apache Kafka for processing with Apache Apex
Stream data from Apache Kafka for processing with Apache ApexApache Apex
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsThomas Weise
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Comsysto Reply GmbH
 
Beam me up, Samza!
Beam me up, Samza!Beam me up, Samza!
Beam me up, Samza!Xinyu Liu
 
Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)Apache Apex
 
Essential Ingredients of Realtime Stream Processing @ Scale
Essential Ingredients of Realtime Stream Processing @ ScaleEssential Ingredients of Realtime Stream Processing @ Scale
Essential Ingredients of Realtime Stream Processing @ ScaleKartik Paramasivam
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansApache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansEvention
 
Unified Stream Processing at Scale with Apache Samza - BDS2017
Unified Stream Processing at Scale with Apache Samza - BDS2017Unified Stream Processing at Scale with Apache Samza - BDS2017
Unified Stream Processing at Scale with Apache Samza - BDS2017Jacob Maes
 
Nextcon samza preso july - final
Nextcon samza preso   july - finalNextcon samza preso   july - final
Nextcon samza preso july - finalYi Pan
 

Similar to Apache Gearpump next-gen streaming engine (20)

Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
 
Flink 0.10 - Upcoming Features
Flink 0.10 - Upcoming FeaturesFlink 0.10 - Upcoming Features
Flink 0.10 - Upcoming Features
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
BigDataSpain 2016: Introduction to Apache Apex
BigDataSpain 2016: Introduction to Apache ApexBigDataSpain 2016: Introduction to Apache Apex
BigDataSpain 2016: Introduction to Apache Apex
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQ
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
 
Stream data from Apache Kafka for processing with Apache Apex
Stream data from Apache Kafka for processing with Apache ApexStream data from Apache Kafka for processing with Apache Apex
Stream data from Apache Kafka for processing with Apache Apex
 
About time
About timeAbout time
About time
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Samza la hug
Samza la hugSamza la hug
Samza la hug
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Beam me up, Samza!
Beam me up, Samza!Beam me up, Samza!
Beam me up, Samza!
 
Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)
 
Essential Ingredients of Realtime Stream Processing @ Scale
Essential Ingredients of Realtime Stream Processing @ ScaleEssential Ingredients of Realtime Stream Processing @ Scale
Essential Ingredients of Realtime Stream Processing @ Scale
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansApache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
 
Unified Stream Processing at Scale with Apache Samza - BDS2017
Unified Stream Processing at Scale with Apache Samza - BDS2017Unified Stream Processing at Scale with Apache Samza - BDS2017
Unified Stream Processing at Scale with Apache Samza - BDS2017
 
Nextcon samza preso july - final
Nextcon samza preso   july - finalNextcon samza preso   july - final
Nextcon samza preso july - final
 

Recently uploaded

The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 

Recently uploaded (20)

The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 

Apache Gearpump next-gen streaming engine

  • 1. Apache Gearpump next-gen streaming engine Manu Zhang, mauzhang@apache.org Sean Zhong, sean-zhong@apache.org
  • 2. What is Gearpump ? ● An Akka[2] based real-time streaming engine ● An Apache Incubator[1] project since Mar.8th, 2016 ● A super simple pump that consists of only two gears but very powerful at streaming water
  • 3. Gearpump Architecture ● Actor concurrency ● Message passing communication ● error handling and isolation with supervision hierarchy ● Master HA with Akka Cluster
  • 5. Stream processing is hard but no system meets all requirements ● Infinite data ● Out-of-order data ● Low latency assurance (e.g real-time recommendation) ● Correctness requirement (e.g. charge advertisers for ads) ● Cheap to update user logics
  • 6. How Gearpump solves the hard parts ● User interface ● Flow control ● Out-of-order processing ● Exactly once ● Dynamic DAG
  • 7. How Gearpump solves the hard parts ● User Interface ● Flow control ● Out-of-order processing ● Exactly Once ● Dynamic DAG
  • 8. User Interface - DAG API Graph( KafkaSource ~ ShufflePartitioner ~> WindowCounter, WindowCounter ~ HashPartitioner ~> ModelCalculator, WindowCounter ~ HashPartitioner ~> AnomalyDetector, ModelCalculator ~> HashPartitioner ~> AnomalyDetector, AnomalyDetector ~> ShufflePartitioner ~> KafkaSink)
  • 9. partitioner User Interface - Partitioner Graph( KafkaSource ~ ShufflePartitioner ~> WindowCounter, WindowCounter ~ HashPartitioner ~> ModelCalculator, WindowCounter ~ HashPartitioner ~> AnomalyDetector, ModelCalculator ~ HashPartitioner ~> AnomalyDetector, AnomalyDetector ~ ShufflePartitioner ~> KafkaSink) WindowCounter (parallelism = 3) ModelCalculator (parallelism = 2) Task Task Task Task Task
  • 10. User Interface - Compose DAG
  • 11. How Gearpump solves the hard parts ● User interface ● Flow control ● Out-of-order processing ● Exactly once ● Dynamic DAG
  • 12. Flow Control 1. Words 3. W indow Counts 2. Window Counts 4. Models 5.Anom aly words KafkaSource WindowCounter ModelCalculator AnomalyDetector KafkaSink
  • 13. Without Flow Control - messages queue up KafkaSource WindowCounter ModelCalculator fast fast very slow
  • 14. Without Flow Control - OOM KafkaSource WindowCounter ModelCalculator fast fast very slow
  • 15. With Flow Control - Backpressure KafkaSource WindowCounter ModelCalculator fast slow down very slow backpressure
  • 16. With Flow Control - Backpressure KafkaSource WindowCounter ModelCalculator Slow down Slow down Very Slow backpressurebackpressure pull slower
  • 17. How Gearpump solves the hard parts ● User Interface ● Flow control ● Out-of-order processing ● Exactly Once ● Dynamic DAG
  • 18. Out-of-order data Event time 321 Processingtime ● Event time - when data generated ● Processing time - when data processed 1 2 3 4 5 6
  • 19. Window Count 1. Words 3. W indow Counts 2. Window Counts 4. Models 5.Anom aly words KafkaSource WindowCounter ModelCalculator AnomalyDetector KafkaSink
  • 20. When can window counts be emitted ? window messages count [0, 2) 1 [2, 4) 2 [4, 6) 2 (“gearpump”, 1) WindowCounter In-memory Table ● No window can be emitted since message as early as time 1 has not arrived (“gearpump”, 1) (“gearpump”, 3) (“gearpump”, 5) (“gearpump”, 4) (“gearpump”, 2) WindowCounter
  • 21. On Watermark[4] Event time 321 Processingtime 1 2 3 4 5 6 watermark ● No timestamp earlier than watermark will be seen
  • 22. Out-of-order processing with watermark window messages count [0, 2) 1 [2, 4) 2 [4, 6) 2 (“gearpump”, 1) WindowCounter In-memory Table (“gearpump”, 1) (“gearpump”, 3) (“gearpump”, 5) (“gearpump”, 4) (“gearpump”, 2) WindowCounter watermark = 0, No window can be emitted
  • 23. Out-of-order processing with watermark window messages count [0, 2) 1 [2, 4) 2 [4, 6) 2 (“gearpum p”, [0, 2), 1) WindowCounter In-memory Table (“gearpump”, 1) (“gearpump”, 3) (“gearpump”, 5) (“gearpump”, 4) (“gearpump”, 2) WindowCounter watermark = 2, Window [0, 2) can be emitted (“gearpump”, [0, 2), 1)
  • 24. How to get watermark ? 1. Words 3. W indow Counts 2. Window Counts 4. Models 5.Anom aly words KafkaSource WindowCounter ModelCalculator AnomalyDetector KafkaSink
  • 25. From upstream 1. Words 3. W indow Counts 2. Window Counts 4. Models 5.Anom aly words KafkaSource WindowCounter ModelCalculator AnomalyDetector KafkaSink watermark
  • 26. More on Watermark ● Source watermark defined by user ● Usually heuristic based ● Users decide whether to drop data arriving after watermark
  • 27. How Gearpump solves the hard parts ● User Interface ● Flow control ● Out-of-order processing ● Exactly once ● Dynamic DAG
  • 28. Exactly Once - no Lost or Duplicated States ● Lost states can cause anomaly words skipped ● Duplicated states can cause false alarm States 4. Models 5.Anom aly words KafkaSource WindowCounter ModelCalculator AnomalyDetector KafkaSink kafka_offset window_counts models
  • 29. Exactly Once with asynchronous checkpointing (2, kafka_offset) KafkaSource WindowCounter ModelCalculator Watermark = 2 Watermark = 0 Watermark = 0 (2, kafka_offset)
  • 30. (2, kafka_offset) (2, window_counts) Exactly Once with asynchronous checkpointing KafkaSource WindowCounter ModelCalculator Watermark = 2 Watermark = 2 Watermark = 0 (2, window_counts)
  • 31. (2, kafka_offset) (2, window_counts) (2, models) Exactly Once with asynchronous checkpointing KafkaSource WindowCounter ModelCalculator Watermark = 2 Watermark = 2 Watermark = 2 (2, models)
  • 32. (2, kafka_offset) (2, window_counts) (2, models) Exactly Once with asynchronous checkpointing KafkaSource WindowCounter ModelCalculator Watermark = 2 Watermark = 2 Watermark = 2 (2, models)
  • 33. (2, kafka_offset) (2, window_counts) (2, models) Crash KafkaSource WindowCounter ModelCalculator Watermark = 2 Watermark = 2 Watermark = 2 (2, models)
  • 34. (2, kafka_offset) (2, window_counts) (2, models) Recover to latest checkpoint at 2 KafkaSource WindowCounter ModelCalculator modelswindow_countskafka_offset Replay from kafka Get state at 2
  • 35. More on Exactly Once ● User extends PersistentState interface and system checkpoints state for users automatically ● Watermark is persisted on master
  • 36. How Gearpump solves the hard parts ● User Interface ● Flow control ● Out-of-order processing ● Exactly Once ● Dynamic DAG
  • 37. Can we update model algorithm in a cheap way ? 1. Words 3. W indow Counts 2. Window Counts 4. Models 5.Anom aly words KafkaSource WindowCounter ModelCalculator AnomalyDetector KafkaSink The model algorithm doesn’t perform well
  • 38. Dynamic DAG - modify application at runtime
  • 41. Yahoo! Streaming Benchmark ● 4 node cluster ● Each node has 32-core, 64 GB memory ● 10GB Ethernet
  • 43. DAG Visualization ● Watermark ● Node size reflects throughput ● Edge width represents flow rate
  • 44. DAG Visualization ● Data skew analysis
  • 45. Apache Beam[6] Gearpump Runner Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Cloud Dataflow Execution Apache Gearpump
  • 46. Experimental features ● Web UI Authorization / OAuth2 Authentication ● CGroup Resource Isolation ● Binary Storm compatibility ● Akka Streams integration
  • 47. Summary ● Gearpump is good at streaming infinite out-of-order data and guarantees correctness ● Gearpump helps users to easily program streaming applications, get runtime information and update dynamically
  • 48. References 1. gearpump.apache.org 2. akka.io 3. http://www.slideshare.net/SeanZhong/strata-singapore-gearpumpreal-ti me-dagprocessing-with-akka-at-scale 4. https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p734-akid au.pdf 5. https://yahooeng.tumblr.com/post/135321837876/benchmarking-streami ng-computation-engines-at 6. Apache Beam [project overview]
  • 50. What is Akka ● Message driven ● Shared nothing ● Location transparent ● Supervision based failure management ● High performance Actor ActorActor Actor Actor spawn