2. What is Gearpump ?
● An Akka[2]
based real-time streaming engine
● An Apache Incubator[1]
project since Mar.8th, 2016
● A super simple pump that consists of only two gears
but very powerful at streaming water
5. Stream processing is hard but no system
meets all requirements
● Infinite data
● Out-of-order data
● Low latency assurance (e.g real-time recommendation)
● Correctness requirement (e.g. charge advertisers for ads)
● Cheap to update user logics
6. How Gearpump solves the hard parts
● User interface
● Flow control
● Out-of-order processing
● Exactly once
● Dynamic DAG
7. How Gearpump solves the hard parts
● User Interface
● Flow control
● Out-of-order processing
● Exactly Once
● Dynamic DAG
11. How Gearpump solves the hard parts
● User interface
● Flow control
● Out-of-order processing
● Exactly once
● Dynamic DAG
12. Flow Control
1. Words
3. W
indow
Counts
2. Window Counts
4. Models
5.Anom
aly
words
KafkaSource WindowCounter
ModelCalculator
AnomalyDetector
KafkaSink
13. Without Flow Control - messages queue up
KafkaSource WindowCounter ModelCalculator
fast fast very slow
14. Without Flow Control - OOM
KafkaSource WindowCounter ModelCalculator
fast fast very slow
15. With Flow Control - Backpressure
KafkaSource WindowCounter ModelCalculator
fast slow down very slow
backpressure
16. With Flow Control - Backpressure
KafkaSource WindowCounter ModelCalculator
Slow down Slow down Very Slow
backpressurebackpressure
pull slower
17. How Gearpump solves the hard parts
● User Interface
● Flow control
● Out-of-order processing
● Exactly Once
● Dynamic DAG
19. Window Count
1. Words
3. W
indow
Counts
2. Window Counts
4. Models
5.Anom
aly
words
KafkaSource WindowCounter
ModelCalculator
AnomalyDetector
KafkaSink
20. When can window counts be emitted ?
window messages count
[0, 2) 1
[2, 4)
2
[4, 6)
2
(“gearpump”, 1)
WindowCounter In-memory Table
● No window can be emitted since
message as early as time 1 has
not arrived
(“gearpump”, 1)
(“gearpump”, 3)
(“gearpump”, 5)
(“gearpump”, 4)
(“gearpump”, 2)
WindowCounter
24. How to get watermark ?
1. Words
3. W
indow
Counts
2. Window Counts
4. Models
5.Anom
aly
words
KafkaSource WindowCounter
ModelCalculator
AnomalyDetector
KafkaSink
25. From upstream
1. Words
3. W
indow
Counts
2. Window Counts
4. Models
5.Anom
aly
words
KafkaSource WindowCounter
ModelCalculator
AnomalyDetector
KafkaSink
watermark
26. More on Watermark
● Source watermark defined by user
● Usually heuristic based
● Users decide whether to drop data arriving after
watermark
27. How Gearpump solves the hard parts
● User Interface
● Flow control
● Out-of-order processing
● Exactly once
● Dynamic DAG
28. Exactly Once - no Lost or Duplicated States
● Lost states can cause
anomaly words skipped
● Duplicated states can cause
false alarm
States
4. Models
5.Anom
aly
words
KafkaSource WindowCounter
ModelCalculator
AnomalyDetector
KafkaSink
kafka_offset window_counts
models
34. (2, kafka_offset)
(2, window_counts)
(2, models)
Recover to latest checkpoint at 2
KafkaSource WindowCounter ModelCalculator
modelswindow_countskafka_offset
Replay from kafka
Get state at 2
35. More on Exactly Once
● User extends PersistentState interface and system
checkpoints state for users automatically
● Watermark is persisted on master
36. How Gearpump solves the hard parts
● User Interface
● Flow control
● Out-of-order processing
● Exactly Once
● Dynamic DAG
37. Can we update model algorithm in a cheap way ?
1. Words
3. W
indow
Counts
2. Window Counts
4. Models
5.Anom
aly
words
KafkaSource WindowCounter
ModelCalculator
AnomalyDetector
KafkaSink
The model algorithm
doesn’t perform well
47. Summary
● Gearpump is good at streaming infinite out-of-order
data and guarantees correctness
● Gearpump helps users to easily program streaming
applications, get runtime information and update
dynamically
50. What is Akka
● Message driven
● Shared nothing
● Location transparent
● Supervision based
failure management
● High performance
Actor
ActorActor
Actor Actor
spawn