Migrating batch-ETLs to
streaming Flink
by William Saar, Updab
About the speaker
Flink, Kafka, Cassandra, Druid, Kubernetes, Java, Scala, Rust
● Updab - Independent Consulting
● tCell.io (acquired by Rapid7) - Consulting, Remote to SF
● King
● Cinnober Financial Technology (now Nasdaq)
● BEA Systems (now Oracle) - Java Mission Control (open-source!) developer
● Essnet (now Scientific Games)
● Digital Route - Telecom
Evolution and Trends
System Architecture: Yesterday
© 2019 Updab AB
System Architecture: Mainstream
© 2019 Updab AB
System Architecture: Emerging
© 2019 Updab AB
System Architecture: Future?
Queries
© 2019 Updab AB
Data pipeline evolution
Nightly Reports Streaming Applications“Real-time” analytics
© 2019 Updab AB
Batch vs Streaming
Pipeline types
● Well-defined batch
○ Nightly reports, ML parameter computations, Data cleaning
● “Wannabe-streaming” batch
○ Charts updating every minute, Alerting
● Streaming
○ Computation for every input (every event for Flink/Kafka Streams, Spark updateStateByKey)
© 2019 Updab AB
Benefits of streaming
● Faster results
● Incremental computation -> Less resources -> Simpler architecture
● Flexible deployment (Flink and Kafka Streams)
○ Lyft’s Kubernetes operator for Flink https://github.com/lyft/flinkk8soperator
● Always up-to-date queryable state (Flink and Kafka Streams)
© 2019 Updab AB
Apache Flink
● Widespread adoption and input source compatibility
○ Used for AWS Kinesis Data Analytics
● Rigorous time models
● Flexible state storage and control of intermediate states
© 2019 Updab AB
Batch architecture
© 2019 Updab AB
Streaming Flink Architecture
© 2019 Updab AB
Translating Batches
Flink Time Windows
© 2019 Updab AB
Other Window Operators
● Count windows
● Session windows
● Custom windows/Process functions/Co-process functions
© 2019 Updab AB
Testing
Testing
● End-to-end tests: Standalone Flink job with source and outputs replaced
○ Docker containers with Kafka, Postgres
● Structure code to support function or stream-segment testing
○ DataStream<Output> out = process(DataStream<Input> in)
○ FlinkSpector tool https://github.com/ottogroup/flink-spector
© 2019 Updab AB
Global Computations
Join keyed and broadcast state
© 2019 Updab AB
Global Computations
● KeyedBroadcastProcessFunction
● Probabilistic data structures: t-Digest, HyperLogLog
● External stream topic
● External service - AsyncFunction
© 2019 Updab AB
Replays
Replays: Why?
● Corrupt data or changing data sets
● Bugs in pipeline logic
● Test different pipeline logic
© 2019 Updab AB
Replays
● The Good: Single checkpoint synchronizes sources with intermediate states
● Challenges: External systems outside Flink’s control
© 2019 Updab AB
Replays: Techniques and helpful practices
● Idempotent writes
● Regular, predictable writes may allow overwriting
● Move external state into Flink - queryable state
© 2019 Updab AB
Thanks!
updab.com
william@updab.com
@saarw

Migrating batch ETLs to streaming Flink