Bootstrapping state in Apache Flink

Bootstrapping
State in Flink
DataWorks Summit 2018
Gregory Fee

What did the
message queue
say to Flink?

About Me
● Engineer @ Lyft
● Teams - ETA, Data Science Platform, Data Platform
● Accomplishments
○ ETA model training from 4 months to every 10 minutes
○ Real-time traffic updates
○ Flyte - Large Scale Orchestration and Batch Compute
○ Lyftlearn - Custom Machine Learning Library
○ Dryft - Real-time Feature Generation for Machine Learning

Dryft
● Need - Consistent Feature Generation
○ The value of your machine learning results is only as good as the data
○ Subtle changes to how a feature value is generated can significantly impact results
● Solution - Unify feature generation
○ Batch processing for bulk creation of features for training ML models
○ Stream processing for real-time creation of features for scoring ML models
● How - SPaaS
○ Use Flink as the processing engine
○ Add automation to make it super simple to launch and maintain feature generation programs
at scale

Flink Overview
● Top level Apache project
● High-throughput, low-latency streaming engine
● Event-time processing
● State management
● Fault-tolerance in the event of machine failure
● Support exactly-once semantics
● Used by Alibaba, Netflix, Uber

Bootstrapping is not Backfilling
● Using historic data to calculate historic results
● Typical uses:
○ Correct for missing data based on pipeline malfunction
○ Generate output for new business logic
● So what is bootstrapping?

Stateful Stream Programs
counts = stream
.flatMap((x) -> x.split("s"))
.map((x) -> new KV(x, 1))
.keyBy((x) -> x.key)
.window(Time.days(7),Time.hours(1))
.sum((x) -> x.value);
Counts of the words that appear in the stream over
the last 7 days updated every hour

The Waiting is the Hardest Part
A program with a 7 day window needs to process for 7 days before it
has enough data to answer the query correctly.
Day 1
Launch Program
Day 3
Anger
Day 6
Bargaining
Day 8
Relief

What about forever?
Table table = tableEnv.sql(
"SELECT user_lyft_id,
COUNT(ride_id)
FROM event_ride_completed
GROUP BY user_lyft_id");
Counts of the number of rides each user
has ever taken

Bootstrapping
Read historic data store to “bootstrap” the program with 7 days
worth of data. Now your program returns results on day 1.
-7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7
Start Program + Validate Results

Provisioning
● We want bootstrapping to be super fast == set
parallelism high
○ Processing a week of data should take less than a week
● We want real-time processing to be super
cheap == set parallelism low
○ Need to host thousands of feature generation programs

Keep in Mind
● Generality is desirable
○ There are potentially simpler ways of
bootstrapping based on your application logic
○ General solution needed to scale to thousands
of programs
● Production Readiness is desirable
○ Observability, scalability, stability, and all those
good things are all considerations
● What works for Lyft might not be right
for you

Use Stream Retention
• Use the retention policy on your stream technology to
retain data for as long as you need
‒ Kinesis maximum retention is 7 days
‒ Kafka has no maximum, stores all data to disks, not
capable of petabytes of storage, suboptimal to spend
disk money on infrequently accessed data
• If this is feasible for you then you should do it
consumerConfig.put(
ConsumerConfigConstants.STREAM_INITIAL_POSITION,
"TRIM_HORIZON");

Kafka “Infinite Retention”
● Alter Kafka to allow for tiered storage
○ Write partitions that age out to secondary storage
○ Push data to S3/Glacier
● Advantages
○ Effectively infinite storage at a reasonable price
○ Use existing Kafka connectors to get data
● Disadvantages
○ Very different performance characteristics of underlying storage
○ No easy way to use different Flink configuration between
bootstrapping and steady state
○ Does not exist today
● Apache Pulsar and Pravega ecosystems might be a
viable alternative

Source Magic
● Write a source that reads from the secondary store until you are within
retention period of your stream
● Transition to reading from stream
● Advantages
○ Works with any stream provider
● Disadvantages
○ Writing a correct source to bridge between two sources and avoid duplication is hard
○ No easy way to use different Flink configuration between bootstrapping and steady state
Discovery Reader
Business
Logic
Kafka Kafka S3 S3 S3 S3

Application Level Attempt #1
1. Run the bootstrap program
a. Read historic data using a normal source
b. Process the data with selected business logic
c. Wait for all processing to complete
d. Trigger a savepoint and cancel the program
2. Run the steady state program
a. Start the program from the savepoint
b. Read stream data using a normal source
● Advantages
○ No modifications to streams or sources
○ Allows for Flink configuration between bootstrapping and steady state
● Disadvantages
○ Let’s find out

How Hard Can It Be?
● How do we make sure there is no repeated data?
SinkS3 Source Business Logic
SinkKinesis Source Business Logic

Iteration #2
● How do we trigger a savepoint when bootstrap is complete?
S3 Source < Target Time Business Logic Sink
Kinesis Source >= Target Time Business Logic Sink

Iteration #3
● After the S3 data is read, push a record that is at (target time +
1)
● Termination detector looks for low watermark to reach (target
time + 1)
S3 Source +
Termination
< Target Time Business Logic
Termination
Detector
“Sink”
Kinesis Source >= Target Time Business Logic Sink

What Did I Learn?
● Automating Flink from within Flink is
possible but fragile
○ Eg If you have multiple partitions reading S3 then you
need to make sure all of them process a message
that pushes the watermark to (target time + 1)
● Savepoint logic is via uid so make sure
those are applied on your business logic
○ No support for setting uid on operators generated via
SQL

Application Level Attempt #2
1. Run a high provisioned job
a. Read from historic data store
b. Read from live stream
c. Union the above
d. Process the data with selected business logic
e. After all S3 data is processed, trigger a savepoint and cancel program
2. Run a low provisioned job
a. Exact same ‘shape’ of program as above, but with less parallelism
b. Restore from savepoint

Success?
● Advantages
○ Less fragile, works with SQL
● Disadvantage
○ Uses many resources or requires external automation
○ Live data is buffered until historic data completes
S3 Source
Kinesis Source
Business
Logic
Sink
< Target Time
>= Target Time

Is it live?
● Running in Production at Lyft now
● Actively adding more feature generation programs

How Could We Make This Better?
● Kafka Infinite Retention
○ Repartition still necessary to get optimal bootstrap performance
● Programs as Sources
○ Allowing sources to be built in a high level programming model,
Beam’s Splittable DoFn
● Dynamic Repartitioning + Adaptive Resource
Management
○ Allow Flink parallelism to change without canceling the program
○ Allow Flink checkpointing policy to change without canceling the
program
● Meta-messages
○ Allow the passing of metadata within the data stream, watermarks
are one type of metadata

What about Batch Mode?
● Batch Mode can be more efficient than Streaming
Mode
○ Offline data has different properties than stream data
● Method #1
○ Use batch mode to process historic data, make a savepoint at
the end
○ Start a streaming mode program from the savepoint, process
stream data
● Method #2
○ Modify Flink to understand a transition watermark, batch
runtime automatically transitions to streaming runtime,
requires unified source

What Did We Learn?
● Many stream programs are stateful
● Faster than real-time
bootstrapping using Flink is
possible
● There are many opportunities for
improvement

Bootstrapping state in Apache Flink

More Related Content

What's hot

Similar to Bootstrapping state in Apache Flink

More from DataWorks Summit

Recently uploaded

Bootstrapping state in Apache Flink

Editor's Notes