Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Gearpump next-gen streaming engine

2,413 views

Published on

Our talk on Apache Gearpump at Strata+Hadoop World Beijing 2016.

Published in: Software

Apache Gearpump next-gen streaming engine

  1. 1. Apache Gearpump next-gen streaming engine Manu Zhang, mauzhang@apache.org Sean Zhong, sean-zhong@apache.org
  2. 2. What is Gearpump ? ● An Akka[2] based real-time streaming engine ● An Apache Incubator[1] project since Mar.8th, 2016 ● A super simple pump that consists of only two gears but very powerful at streaming water
  3. 3. Gearpump Architecture ● Actor concurrency ● Message passing communication ● error handling and isolation with supervision hierarchy ● Master HA with Akka Cluster
  4. 4. Why Gearpump ?
  5. 5. Stream processing is hard but no system meets all requirements ● Infinite data ● Out-of-order data ● Low latency assurance (e.g real-time recommendation) ● Correctness requirement (e.g. charge advertisers for ads) ● Cheap to update user logics
  6. 6. How Gearpump solves the hard parts ● User interface ● Flow control ● Out-of-order processing ● Exactly once ● Dynamic DAG
  7. 7. How Gearpump solves the hard parts ● User Interface ● Flow control ● Out-of-order processing ● Exactly Once ● Dynamic DAG
  8. 8. User Interface - DAG API Graph( KafkaSource ~ ShufflePartitioner ~> WindowCounter, WindowCounter ~ HashPartitioner ~> ModelCalculator, WindowCounter ~ HashPartitioner ~> AnomalyDetector, ModelCalculator ~> HashPartitioner ~> AnomalyDetector, AnomalyDetector ~> ShufflePartitioner ~> KafkaSink)
  9. 9. partitioner User Interface - Partitioner Graph( KafkaSource ~ ShufflePartitioner ~> WindowCounter, WindowCounter ~ HashPartitioner ~> ModelCalculator, WindowCounter ~ HashPartitioner ~> AnomalyDetector, ModelCalculator ~ HashPartitioner ~> AnomalyDetector, AnomalyDetector ~ ShufflePartitioner ~> KafkaSink) WindowCounter (parallelism = 3) ModelCalculator (parallelism = 2) Task Task Task Task Task
  10. 10. User Interface - Compose DAG
  11. 11. How Gearpump solves the hard parts ● User interface ● Flow control ● Out-of-order processing ● Exactly once ● Dynamic DAG
  12. 12. Flow Control 1. Words 3. W indow Counts 2. Window Counts 4. Models 5.Anom aly words KafkaSource WindowCounter ModelCalculator AnomalyDetector KafkaSink
  13. 13. Without Flow Control - messages queue up KafkaSource WindowCounter ModelCalculator fast fast very slow
  14. 14. Without Flow Control - OOM KafkaSource WindowCounter ModelCalculator fast fast very slow
  15. 15. With Flow Control - Backpressure KafkaSource WindowCounter ModelCalculator fast slow down very slow backpressure
  16. 16. With Flow Control - Backpressure KafkaSource WindowCounter ModelCalculator Slow down Slow down Very Slow backpressurebackpressure pull slower
  17. 17. How Gearpump solves the hard parts ● User Interface ● Flow control ● Out-of-order processing ● Exactly Once ● Dynamic DAG
  18. 18. Out-of-order data Event time 321 Processingtime ● Event time - when data generated ● Processing time - when data processed 1 2 3 4 5 6
  19. 19. Window Count 1. Words 3. W indow Counts 2. Window Counts 4. Models 5.Anom aly words KafkaSource WindowCounter ModelCalculator AnomalyDetector KafkaSink
  20. 20. When can window counts be emitted ? window messages count [0, 2) 1 [2, 4) 2 [4, 6) 2 (“gearpump”, 1) WindowCounter In-memory Table ● No window can be emitted since message as early as time 1 has not arrived (“gearpump”, 1) (“gearpump”, 3) (“gearpump”, 5) (“gearpump”, 4) (“gearpump”, 2) WindowCounter
  21. 21. On Watermark[4] Event time 321 Processingtime 1 2 3 4 5 6 watermark ● No timestamp earlier than watermark will be seen
  22. 22. Out-of-order processing with watermark window messages count [0, 2) 1 [2, 4) 2 [4, 6) 2 (“gearpump”, 1) WindowCounter In-memory Table (“gearpump”, 1) (“gearpump”, 3) (“gearpump”, 5) (“gearpump”, 4) (“gearpump”, 2) WindowCounter watermark = 0, No window can be emitted
  23. 23. Out-of-order processing with watermark window messages count [0, 2) 1 [2, 4) 2 [4, 6) 2 (“gearpum p”, [0, 2), 1) WindowCounter In-memory Table (“gearpump”, 1) (“gearpump”, 3) (“gearpump”, 5) (“gearpump”, 4) (“gearpump”, 2) WindowCounter watermark = 2, Window [0, 2) can be emitted (“gearpump”, [0, 2), 1)
  24. 24. How to get watermark ? 1. Words 3. W indow Counts 2. Window Counts 4. Models 5.Anom aly words KafkaSource WindowCounter ModelCalculator AnomalyDetector KafkaSink
  25. 25. From upstream 1. Words 3. W indow Counts 2. Window Counts 4. Models 5.Anom aly words KafkaSource WindowCounter ModelCalculator AnomalyDetector KafkaSink watermark
  26. 26. More on Watermark ● Source watermark defined by user ● Usually heuristic based ● Users decide whether to drop data arriving after watermark
  27. 27. How Gearpump solves the hard parts ● User Interface ● Flow control ● Out-of-order processing ● Exactly once ● Dynamic DAG
  28. 28. Exactly Once - no Lost or Duplicated States ● Lost states can cause anomaly words skipped ● Duplicated states can cause false alarm States 4. Models 5.Anom aly words KafkaSource WindowCounter ModelCalculator AnomalyDetector KafkaSink kafka_offset window_counts models
  29. 29. Exactly Once with asynchronous checkpointing (2, kafka_offset) KafkaSource WindowCounter ModelCalculator Watermark = 2 Watermark = 0 Watermark = 0 (2, kafka_offset)
  30. 30. (2, kafka_offset) (2, window_counts) Exactly Once with asynchronous checkpointing KafkaSource WindowCounter ModelCalculator Watermark = 2 Watermark = 2 Watermark = 0 (2, window_counts)
  31. 31. (2, kafka_offset) (2, window_counts) (2, models) Exactly Once with asynchronous checkpointing KafkaSource WindowCounter ModelCalculator Watermark = 2 Watermark = 2 Watermark = 2 (2, models)
  32. 32. (2, kafka_offset) (2, window_counts) (2, models) Exactly Once with asynchronous checkpointing KafkaSource WindowCounter ModelCalculator Watermark = 2 Watermark = 2 Watermark = 2 (2, models)
  33. 33. (2, kafka_offset) (2, window_counts) (2, models) Crash KafkaSource WindowCounter ModelCalculator Watermark = 2 Watermark = 2 Watermark = 2 (2, models)
  34. 34. (2, kafka_offset) (2, window_counts) (2, models) Recover to latest checkpoint at 2 KafkaSource WindowCounter ModelCalculator modelswindow_countskafka_offset Replay from kafka Get state at 2
  35. 35. More on Exactly Once ● User extends PersistentState interface and system checkpoints state for users automatically ● Watermark is persisted on master
  36. 36. How Gearpump solves the hard parts ● User Interface ● Flow control ● Out-of-order processing ● Exactly Once ● Dynamic DAG
  37. 37. Can we update model algorithm in a cheap way ? 1. Words 3. W indow Counts 2. Window Counts 4. Models 5.Anom aly words KafkaSource WindowCounter ModelCalculator AnomalyDetector KafkaSink The model algorithm doesn’t perform well
  38. 38. Dynamic DAG - modify application at runtime
  39. 39. Performance
  40. 40. Yahoo! Streaming Benchmark[5]
  41. 41. Yahoo! Streaming Benchmark ● 4 node cluster ● Each node has 32-core, 64 GB memory ● 10GB Ethernet
  42. 42. Advanced features
  43. 43. DAG Visualization ● Watermark ● Node size reflects throughput ● Edge width represents flow rate
  44. 44. DAG Visualization ● Data skew analysis
  45. 45. Apache Beam[6] Gearpump Runner Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Cloud Dataflow Execution Apache Gearpump
  46. 46. Experimental features ● Web UI Authorization / OAuth2 Authentication ● CGroup Resource Isolation ● Binary Storm compatibility ● Akka Streams integration
  47. 47. Summary ● Gearpump is good at streaming infinite out-of-order data and guarantees correctness ● Gearpump helps users to easily program streaming applications, get runtime information and update dynamically
  48. 48. References 1. gearpump.apache.org 2. akka.io 3. http://www.slideshare.net/SeanZhong/strata-singapore-gearpumpreal-ti me-dagprocessing-with-akka-at-scale 4. https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p734-akid au.pdf 5. https://yahooeng.tumblr.com/post/135321837876/benchmarking-streami ng-computation-engines-at 6. Apache Beam [project overview]
  49. 49. Backup Slides
  50. 50. What is Akka ● Message driven ● Shared nothing ● Location transparent ● Supervision based failure management ● High performance Actor ActorActor Actor Actor spawn

×