Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Migrating batch ETLs to streaming Flink

388 views

Published on

Data engineering presentation about the evolution of data infrastructure and migrating batch analytics jobs to streaming pipelines with Apache Flink.

Published in: Data & Analytics
  • Be the first to comment

Migrating batch ETLs to streaming Flink

  1. 1. Migrating batch-ETLs to streaming Flink by William Saar, Updab
  2. 2. About the speaker Flink, Kafka, Cassandra, Druid, Kubernetes, Java, Scala, Rust ● Updab - Independent Consulting ● tCell.io (acquired by Rapid7) - Consulting, Remote to SF ● King ● Cinnober Financial Technology (now Nasdaq) ● BEA Systems (now Oracle) - Java Mission Control (open-source!) developer ● Essnet (now Scientific Games) ● Digital Route - Telecom
  3. 3. Evolution and Trends
  4. 4. System Architecture: Yesterday © 2019 Updab AB
  5. 5. System Architecture: Mainstream © 2019 Updab AB
  6. 6. System Architecture: Emerging © 2019 Updab AB
  7. 7. System Architecture: Future? Queries © 2019 Updab AB
  8. 8. Data pipeline evolution Nightly Reports Streaming Applications“Real-time” analytics © 2019 Updab AB
  9. 9. Batch vs Streaming
  10. 10. Pipeline types ● Well-defined batch ○ Nightly reports, ML parameter computations, Data cleaning ● “Wannabe-streaming” batch ○ Charts updating every minute, Alerting ● Streaming ○ Computation for every input (every event for Flink/Kafka Streams, Spark updateStateByKey) © 2019 Updab AB
  11. 11. Benefits of streaming ● Faster results ● Incremental computation -> Less resources -> Simpler architecture ● Flexible deployment (Flink and Kafka Streams) ○ Lyft’s Kubernetes operator for Flink https://github.com/lyft/flinkk8soperator ● Always up-to-date queryable state (Flink and Kafka Streams) © 2019 Updab AB
  12. 12. Apache Flink ● Widespread adoption and input source compatibility ○ Used for AWS Kinesis Data Analytics ● Rigorous time models ● Flexible state storage and control of intermediate states © 2019 Updab AB
  13. 13. Batch architecture © 2019 Updab AB
  14. 14. Streaming Flink Architecture © 2019 Updab AB
  15. 15. Translating Batches
  16. 16. Flink Time Windows © 2019 Updab AB
  17. 17. Other Window Operators ● Count windows ● Session windows ● Custom windows/Process functions/Co-process functions © 2019 Updab AB
  18. 18. Testing
  19. 19. Testing ● End-to-end tests: Standalone Flink job with source and outputs replaced ○ Docker containers with Kafka, Postgres ● Structure code to support function or stream-segment testing ○ DataStream<Output> out = process(DataStream<Input> in) ○ FlinkSpector tool https://github.com/ottogroup/flink-spector © 2019 Updab AB
  20. 20. Global Computations
  21. 21. Join keyed and broadcast state © 2019 Updab AB
  22. 22. Global Computations ● KeyedBroadcastProcessFunction ● Probabilistic data structures: t-Digest, HyperLogLog ● External stream topic ● External service - AsyncFunction © 2019 Updab AB
  23. 23. Replays
  24. 24. Replays: Why? ● Corrupt data or changing data sets ● Bugs in pipeline logic ● Test different pipeline logic © 2019 Updab AB
  25. 25. Replays ● The Good: Single checkpoint synchronizes sources with intermediate states ● Challenges: External systems outside Flink’s control © 2019 Updab AB
  26. 26. Replays: Techniques and helpful practices ● Idempotent writes ● Regular, predictable writes may allow overwriting ● Move external state into Flink - queryable state © 2019 Updab AB
  27. 27. Thanks! updab.com william@updab.com @saarw

×