Bootstrapping
State in Flink
DataWorks Summit 2018
Gregory Fee
What did the
message queue
say to Flink?
Sad at Work!
Sad At Work!
DataWorks!
About Me
● Engineer @ Lyft
● Teams - ETA, Data Science Platform, Data Platform
● Accomplishments
○ ETA model training from 4 months to every 10 minutes
○ Real-time traffic updates
○ Flyte - Large Scale Orchestration and Batch Compute
○ Lyftlearn - Custom Machine Learning Library
○ Dryft - Real-time Feature Generation for Machine Learning
Dryft
● Need - Consistent Feature Generation
○ The value of your machine learning results is only as good as the data
○ Subtle changes to how a feature value is generated can significantly impact results
● Solution - Unify feature generation
○ Batch processing for bulk creation of features for training ML models
○ Stream processing for real-time creation of features for scoring ML models
● How - SPaaS
○ Use Flink as the processing engine
○ Add automation to make it super simple to launch and maintain feature generation programs
at scale
Flink Overview
● Top level Apache project
● High-throughput, low-latency streaming engine
● Event-time processing
● State management
● Fault-tolerance in the event of machine failure
● Support exactly-once semantics
● Used by Alibaba, Netflix, Uber
What is Bootstrapping?
Bootstrapping is not Backfilling
● Using historic data to calculate historic results
● Typical uses:
○ Correct for missing data based on pipeline malfunction
○ Generate output for new business logic
● So what is bootstrapping?
Stateful Stream Programs
counts = stream
.flatMap((x) -> x.split("s"))
.map((x) -> new KV(x, 1))
.keyBy((x) -> x.key)
.window(Time.days(7),Time.hours(1))
.sum((x) -> x.value);
Counts of the words that appear in the stream over
the last 7 days updated every hour
The Waiting is the Hardest Part
A program with a 7 day window needs to process for 7 days before it
has enough data to answer the query correctly.
Day 1
Launch Program
Day 3
Anger
Day 6
Bargaining
Day 8
Relief
What about forever?
Table table = tableEnv.sql(
"SELECT user_lyft_id,
COUNT(ride_id)
FROM event_ride_completed
GROUP BY user_lyft_id");
Counts of the number of rides each user
has ever taken
Bootstrapping
Read historic data store to “bootstrap” the program with 7 days
worth of data. Now your program returns results on day 1.
-7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7
Start Program + Validate Results
Provisioning
● We want bootstrapping to be super fast == set
parallelism high
○ Processing a week of data should take less than a week
● We want real-time processing to be super
cheap == set parallelism low
○ Need to host thousands of feature generation programs
Keep in Mind
● Generality is desirable
○ There are potentially simpler ways of
bootstrapping based on your application logic
○ General solution needed to scale to thousands
of programs
● Production Readiness is desirable
○ Observability, scalability, stability, and all those
good things are all considerations
● What works for Lyft might not be right
for you
Use Stream Retention
• Use the retention policy on your stream technology to
retain data for as long as you need
‒ Kinesis maximum retention is 7 days
‒ Kafka has no maximum, stores all data to disks, not
capable of petabytes of storage, suboptimal to spend
disk money on infrequently accessed data
• If this is feasible for you then you should do it
consumerConfig.put(
ConsumerConfigConstants.STREAM_INITIAL_POSITION,
"TRIM_HORIZON");
Kafka “Infinite Retention”
● Alter Kafka to allow for tiered storage
○ Write partitions that age out to secondary storage
○ Push data to S3/Glacier
● Advantages
○ Effectively infinite storage at a reasonable price
○ Use existing Kafka connectors to get data
● Disadvantages
○ Very different performance characteristics of underlying storage
○ No easy way to use different Flink configuration between
bootstrapping and steady state
○ Does not exist today
● Apache Pulsar and Pravega ecosystems might be a
viable alternative
Source Magic
● Write a source that reads from the secondary store until you are within
retention period of your stream
● Transition to reading from stream
● Advantages
○ Works with any stream provider
● Disadvantages
○ Writing a correct source to bridge between two sources and avoid duplication is hard
○ No easy way to use different Flink configuration between bootstrapping and steady state
Discovery Reader
Business
Logic
Kafka Kafka S3 S3 S3 S3
Application Level Attempt #1
1. Run the bootstrap program
a. Read historic data using a normal source
b. Process the data with selected business logic
c. Wait for all processing to complete
d. Trigger a savepoint and cancel the program
2. Run the steady state program
a. Start the program from the savepoint
b. Read stream data using a normal source
● Advantages
○ No modifications to streams or sources
○ Allows for Flink configuration between bootstrapping and steady state
● Disadvantages
○ Let’s find out
How Hard Can It Be?
● How do we make sure there is no repeated data?
SinkS3 Source Business Logic
SinkKinesis Source Business Logic
Iteration #2
● How do we trigger a savepoint when bootstrap is complete?
S3 Source < Target Time Business Logic Sink
Kinesis Source >= Target Time Business Logic Sink
Iteration #3
● After the S3 data is read, push a record that is at (target time +
1)
● Termination detector looks for low watermark to reach (target
time + 1)
S3 Source +
Termination
< Target Time Business Logic
Termination
Detector
“Sink”
Kinesis Source >= Target Time Business Logic Sink
What Did I Learn?
● Automating Flink from within Flink is
possible but fragile
○ Eg If you have multiple partitions reading S3 then you
need to make sure all of them process a message
that pushes the watermark to (target time + 1)
● Savepoint logic is via uid so make sure
those are applied on your business logic
○ No support for setting uid on operators generated via
SQL
Application Level Attempt #2
1. Run a high provisioned job
a. Read from historic data store
b. Read from live stream
c. Union the above
d. Process the data with selected business logic
e. After all S3 data is processed, trigger a savepoint and cancel program
2. Run a low provisioned job
a. Exact same ‘shape’ of program as above, but with less parallelism
b. Restore from savepoint
Success?
● Advantages
○ Less fragile, works with SQL
● Disadvantage
○ Uses many resources or requires external automation
○ Live data is buffered until historic data completes
S3 Source
Kinesis Source
Business
Logic
Sink
< Target Time
>= Target Time
Is it live?
● Running in Production at Lyft now
● Actively adding more feature generation programs
How Could We Make This Better?
● Kafka Infinite Retention
○ Repartition still necessary to get optimal bootstrap performance
● Programs as Sources
○ Allowing sources to be built in a high level programming model,
Beam’s Splittable DoFn
● Dynamic Repartitioning + Adaptive Resource
Management
○ Allow Flink parallelism to change without canceling the program
○ Allow Flink checkpointing policy to change without canceling the
program
● Meta-messages
○ Allow the passing of metadata within the data stream, watermarks
are one type of metadata
What about Batch Mode?
● Batch Mode can be more efficient than Streaming
Mode
○ Offline data has different properties than stream data
● Method #1
○ Use batch mode to process historic data, make a savepoint at
the end
○ Start a streaming mode program from the savepoint, process
stream data
● Method #2
○ Modify Flink to understand a transition watermark, batch
runtime automatically transitions to streaming runtime,
requires unified source
What Did We Learn?
● Many stream programs are stateful
● Faster than real-time
bootstrapping using Flink is
possible
● There are many opportunities for
improvement
Q&A

Bootstrapping state in Apache Flink

  • 1.
  • 2.
    What did the messagequeue say to Flink?
  • 3.
  • 4.
  • 5.
    About Me ● Engineer@ Lyft ● Teams - ETA, Data Science Platform, Data Platform ● Accomplishments ○ ETA model training from 4 months to every 10 minutes ○ Real-time traffic updates ○ Flyte - Large Scale Orchestration and Batch Compute ○ Lyftlearn - Custom Machine Learning Library ○ Dryft - Real-time Feature Generation for Machine Learning
  • 6.
    Dryft ● Need -Consistent Feature Generation ○ The value of your machine learning results is only as good as the data ○ Subtle changes to how a feature value is generated can significantly impact results ● Solution - Unify feature generation ○ Batch processing for bulk creation of features for training ML models ○ Stream processing for real-time creation of features for scoring ML models ● How - SPaaS ○ Use Flink as the processing engine ○ Add automation to make it super simple to launch and maintain feature generation programs at scale
  • 7.
    Flink Overview ● Toplevel Apache project ● High-throughput, low-latency streaming engine ● Event-time processing ● State management ● Fault-tolerance in the event of machine failure ● Support exactly-once semantics ● Used by Alibaba, Netflix, Uber
  • 8.
  • 9.
    Bootstrapping is notBackfilling ● Using historic data to calculate historic results ● Typical uses: ○ Correct for missing data based on pipeline malfunction ○ Generate output for new business logic ● So what is bootstrapping?
  • 10.
    Stateful Stream Programs counts= stream .flatMap((x) -> x.split("s")) .map((x) -> new KV(x, 1)) .keyBy((x) -> x.key) .window(Time.days(7),Time.hours(1)) .sum((x) -> x.value); Counts of the words that appear in the stream over the last 7 days updated every hour
  • 11.
    The Waiting isthe Hardest Part A program with a 7 day window needs to process for 7 days before it has enough data to answer the query correctly. Day 1 Launch Program Day 3 Anger Day 6 Bargaining Day 8 Relief
  • 12.
    What about forever? Tabletable = tableEnv.sql( "SELECT user_lyft_id, COUNT(ride_id) FROM event_ride_completed GROUP BY user_lyft_id"); Counts of the number of rides each user has ever taken
  • 13.
    Bootstrapping Read historic datastore to “bootstrap” the program with 7 days worth of data. Now your program returns results on day 1. -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 Start Program + Validate Results
  • 14.
    Provisioning ● We wantbootstrapping to be super fast == set parallelism high ○ Processing a week of data should take less than a week ● We want real-time processing to be super cheap == set parallelism low ○ Need to host thousands of feature generation programs
  • 15.
    Keep in Mind ●Generality is desirable ○ There are potentially simpler ways of bootstrapping based on your application logic ○ General solution needed to scale to thousands of programs ● Production Readiness is desirable ○ Observability, scalability, stability, and all those good things are all considerations ● What works for Lyft might not be right for you
  • 16.
    Use Stream Retention •Use the retention policy on your stream technology to retain data for as long as you need ‒ Kinesis maximum retention is 7 days ‒ Kafka has no maximum, stores all data to disks, not capable of petabytes of storage, suboptimal to spend disk money on infrequently accessed data • If this is feasible for you then you should do it consumerConfig.put( ConsumerConfigConstants.STREAM_INITIAL_POSITION, "TRIM_HORIZON");
  • 17.
    Kafka “Infinite Retention” ●Alter Kafka to allow for tiered storage ○ Write partitions that age out to secondary storage ○ Push data to S3/Glacier ● Advantages ○ Effectively infinite storage at a reasonable price ○ Use existing Kafka connectors to get data ● Disadvantages ○ Very different performance characteristics of underlying storage ○ No easy way to use different Flink configuration between bootstrapping and steady state ○ Does not exist today ● Apache Pulsar and Pravega ecosystems might be a viable alternative
  • 18.
    Source Magic ● Writea source that reads from the secondary store until you are within retention period of your stream ● Transition to reading from stream ● Advantages ○ Works with any stream provider ● Disadvantages ○ Writing a correct source to bridge between two sources and avoid duplication is hard ○ No easy way to use different Flink configuration between bootstrapping and steady state Discovery Reader Business Logic Kafka Kafka S3 S3 S3 S3
  • 19.
    Application Level Attempt#1 1. Run the bootstrap program a. Read historic data using a normal source b. Process the data with selected business logic c. Wait for all processing to complete d. Trigger a savepoint and cancel the program 2. Run the steady state program a. Start the program from the savepoint b. Read stream data using a normal source ● Advantages ○ No modifications to streams or sources ○ Allows for Flink configuration between bootstrapping and steady state ● Disadvantages ○ Let’s find out
  • 20.
    How Hard CanIt Be? ● How do we make sure there is no repeated data? SinkS3 Source Business Logic SinkKinesis Source Business Logic
  • 21.
    Iteration #2 ● Howdo we trigger a savepoint when bootstrap is complete? S3 Source < Target Time Business Logic Sink Kinesis Source >= Target Time Business Logic Sink
  • 22.
    Iteration #3 ● Afterthe S3 data is read, push a record that is at (target time + 1) ● Termination detector looks for low watermark to reach (target time + 1) S3 Source + Termination < Target Time Business Logic Termination Detector “Sink” Kinesis Source >= Target Time Business Logic Sink
  • 23.
    What Did ILearn? ● Automating Flink from within Flink is possible but fragile ○ Eg If you have multiple partitions reading S3 then you need to make sure all of them process a message that pushes the watermark to (target time + 1) ● Savepoint logic is via uid so make sure those are applied on your business logic ○ No support for setting uid on operators generated via SQL
  • 24.
    Application Level Attempt#2 1. Run a high provisioned job a. Read from historic data store b. Read from live stream c. Union the above d. Process the data with selected business logic e. After all S3 data is processed, trigger a savepoint and cancel program 2. Run a low provisioned job a. Exact same ‘shape’ of program as above, but with less parallelism b. Restore from savepoint
  • 25.
    Success? ● Advantages ○ Lessfragile, works with SQL ● Disadvantage ○ Uses many resources or requires external automation ○ Live data is buffered until historic data completes S3 Source Kinesis Source Business Logic Sink < Target Time >= Target Time
  • 26.
    Is it live? ●Running in Production at Lyft now ● Actively adding more feature generation programs
  • 27.
    How Could WeMake This Better? ● Kafka Infinite Retention ○ Repartition still necessary to get optimal bootstrap performance ● Programs as Sources ○ Allowing sources to be built in a high level programming model, Beam’s Splittable DoFn ● Dynamic Repartitioning + Adaptive Resource Management ○ Allow Flink parallelism to change without canceling the program ○ Allow Flink checkpointing policy to change without canceling the program ● Meta-messages ○ Allow the passing of metadata within the data stream, watermarks are one type of metadata
  • 28.
    What about BatchMode? ● Batch Mode can be more efficient than Streaming Mode ○ Offline data has different properties than stream data ● Method #1 ○ Use batch mode to process historic data, make a savepoint at the end ○ Start a streaming mode program from the savepoint, process stream data ● Method #2 ○ Modify Flink to understand a transition watermark, batch runtime automatically transitions to streaming runtime, requires unified source
  • 29.
    What Did WeLearn? ● Many stream programs are stateful ● Faster than real-time bootstrapping using Flink is possible ● There are many opportunities for improvement
  • 30.

Editor's Notes

  • #9 “A technique of loading a program into a computer by means of a few initial instructions that enable the introduction of the rest of the program from an input device.”
  • #18 Even this system has issues that we’ll talk more about later