Running Flink in
Production
Oct 8, 2019
Lakshmi Rao | @glaksh100
What’s Lyft?
3
Dynamic Pricing
Supply/Demand curve
Pricing
Notifications
ETA Prediction
Coupons
User Experience
Fraud
Behaviour Fingerprinting
Monetary Impact
Imperative to act fast
Top Destinations
Core Experience
Streaming
Platform @ Lyft
Data Platform @ Lyft
Backend Services
Mobile app
Streaming sources
Streaming sinks
Events
❖ Data pipelines
❖ Marketplace: Speed
observations, locations data
❖ Building ML models
❖ Pricing platform
❖ Data engineers
❖ Research scientists
❖ Product Managers
❖ Business operations
Analytics ML Engineers Engineers
Users @ Lyft
8
Abstraction levels
Flink jobs
API: Java
Users: Engineers
Use cases: Data pipelines,
ETA, Mapping
% Jobs: 45%
Dryft
API: Streaming SQL
Users: Research Scientists,
Data Engineers
Use cases: Fraud detection,
coupons
% Jobs: 53%
Flink + Python ⇒ Beam
API: Python
Users: ML Engineers
Use cases: Dynamic
Pricing[1]
% Jobs: 2%
2
1
3
Streaming Platform 0
[1] Streaming your Lyft ride prices
Lifecycle of a
Flink job
10
Creating a new Flink Job
● Boilerplate generator to kickstart a new Flink job
● https://yeoman.io/ → [yo streamingservice helloworld]
● A representative example to make development easy
DataStream<KinesisRecord> recordStream = env.addSource(sourceFunc).name("source").uid("source");
DataStream<UserAndCount> locationStream =
recordStream
.flatMap(new LocationEventExtractor())
.filter(new LocationFilter())
.keyBy("userId")
.timeWindow(Time.seconds(1))
.aggregate(new LocationsCounter())
.uid("locations-counter");
locationStream.addSink(sinkFunc).name("sink").uid("sink");
11
Creating a new Flink Job
Flink
Library
Application Metrics
Logging
SerDe Utils
12
Deployment and tooling
Job Manager
TMTM
TMTM
S3Command line
13
Learnings
+ Boilerplate generators 👍
- Constant maintenance and updates
+ Automating observability 💯
- User education is a challenge
+ Command line tooling to make interaction with jobs easy
- Manual and hence error-prone
14
Production stories
Image Credit: K.C. Green
15
Integrations
16
Balancing shard assignments
● Problem: Re-sharding of Kinesis streams with non-sequential shard IDs
● Custom shard assigner that can be extended
● Round-robin shard assignment (across re-sharding of a stream)
17
Faster and adaptive reads
Constant # Records
Variable bytes
18
Faster and adaptive reads
Variable # records
Constant (max) bytes
19
Event time skew in source partitions
Shard
Consumer
t=0 t=1 t=2 t=100
Event time scale
1
1 123
2
11
12
13 11
10
15
10
13
12
11
10
15
7
8
7
8
Shard
Consumer
State
t=1 t=5 t=10 t=15
1
1 3
2
12
1
1
20
Source Synchronization
Shard
Thread
Shard
Thread
Source
Thread
Global
watermark
21
With synchronization
22
Flink contributions
● Shard balancing
○ Per-shard watermarking [FLINK-5697]
○ Custom Shard assignment [FLINK-8516]
○ ListShards API [FLINK-8944]
● Adaptive reads
○ Adding adaptive reads [FLINK-9692]
● Source synchronization
○ Global Aggregate Manager in JobMaster [FLINK-10887]
○ Source synchronization in the Kinesis connector [FLINK-10921]
○ (Long term) Event time alignment as a part of FLIP-27
23
Checkpointing
● Transient errors in reading and writing checkpoints to S3
● Frequent restarts
Filesystem Wrapper
Application
Connector Extensions
Flink
Library
Application
24
Checkpointing
● Hot partitions
s3://<bucket>/application/checkpoints/
● [FLINK-9061] Entropy injection
● Entropy injected path
s3://<bucket>/<HASH>/application/checkpoints/
● Externalized checkpoints ✅
25
Challenges
● A lot of data pipeline use cases
○ Log ingestion to Kibana
○ Change data capture logs
○ Analytics event pipeline
● Common requirement
○ Data freshness
○ Data completeness (eventual)
● During an outage
○ Recover job and resume real-time data
processing
○ Backfill mechanism
● Checkpointing under backpressure
○ Failing checkpoints
○ Limited forward progress
○ [FLIP-76]
○ Debugging bottleneck is hard
● Kafka consumer improvements
○ Idle partition detection
○ Source synchronization
● Recovery and HA
○ Partial recovery from task failures [FLIP-1]
○ Zookeeper and HA [FLINK-10030]
26
Learnings
● Bandwidth control is important
● Wrappers make users ☺
● Investing in a performance bench for every new connector is 💯
● Checkpointing under backpressure 😨
● Open source ⇒ 🎉🎉🎉
What’s next?
28
Kubernetes
Image Credit: Dilbert
29
Flink on k8s: Motivation
● Deploying new jobs required provisioning AWS resources
● 10+ minute replacement time for node failures
● Manual, error-prone process to update jobs, rollback on failures
● Need for immutable infrastructure
30
Flink on k8s: Overview
● Core component: Flinkk8soperator[1]
● Operator manages Flink applications
Job
Definition
JM
TM
Flink
operator
TM
TM TM
[1] Managing Flink on Kubernetes
Flinkk8soperator: Design goal
Any operation that produces downtime should either
succeed, or roll back as soon as the error is detected
Flink Application CRD
● Flink Application custom
resource definition
● Describes desired state for the
Flink application
● Flink operator responsible for
evolving the application to that
state
apiVersion: flink.k8s.io/v1beta1
kind: FlinkApplication
metadata:
name: wordcount
spec:
image: lyft/wordcount:latest
jarName: "wordcount-1.0.0-SNAPSHOT.jar"
parallelism: 30
entryClass: "com.lyft.WordCount"
flinkVersion: "1.8"
flinkConfig:
state.backend: filesystem
jobManagerConfig:
resources:
requests:
memory: "8Gi"
cpu: "4"
taskManagerConfig:
resources:
requests:
memory: "8Gi"
cpu: "8"
33
New
Create Cluster Submit JobSavepointing
Rolling back
Updating
Update
Failed
Running
UPDATING
RUNNING
State Machine
34
Flink on k8s: Wins
● Improved stability and recoverability
● Less downtime due to common failures
● Easier configuration and tooling
● More flexible deployment strategies
35
Open Source
We open-sourced the Flink Operator in May. Already in
production use @ Lyft.
(https://github.com/lyft/flinkk8soperator)
● 5 external contributors
● 20 external contributions
● 33 issues opened
36
We are hiring! lyft.com/careers
Q & A

Running Flink in Production: The good, The bad and The in Between - Lakshmi Rao, Lyft (1)

  • 1.
    Running Flink in Production Oct8, 2019 Lakshmi Rao | @glaksh100
  • 2.
  • 3.
  • 4.
    Dynamic Pricing Supply/Demand curve Pricing Notifications ETAPrediction Coupons User Experience Fraud Behaviour Fingerprinting Monetary Impact Imperative to act fast Top Destinations Core Experience
  • 5.
  • 6.
    Data Platform @Lyft Backend Services Mobile app Streaming sources Streaming sinks Events
  • 7.
    ❖ Data pipelines ❖Marketplace: Speed observations, locations data ❖ Building ML models ❖ Pricing platform ❖ Data engineers ❖ Research scientists ❖ Product Managers ❖ Business operations Analytics ML Engineers Engineers Users @ Lyft
  • 8.
    8 Abstraction levels Flink jobs API:Java Users: Engineers Use cases: Data pipelines, ETA, Mapping % Jobs: 45% Dryft API: Streaming SQL Users: Research Scientists, Data Engineers Use cases: Fraud detection, coupons % Jobs: 53% Flink + Python ⇒ Beam API: Python Users: ML Engineers Use cases: Dynamic Pricing[1] % Jobs: 2% 2 1 3 Streaming Platform 0 [1] Streaming your Lyft ride prices
  • 9.
  • 10.
    10 Creating a newFlink Job ● Boilerplate generator to kickstart a new Flink job ● https://yeoman.io/ → [yo streamingservice helloworld] ● A representative example to make development easy DataStream<KinesisRecord> recordStream = env.addSource(sourceFunc).name("source").uid("source"); DataStream<UserAndCount> locationStream = recordStream .flatMap(new LocationEventExtractor()) .filter(new LocationFilter()) .keyBy("userId") .timeWindow(Time.seconds(1)) .aggregate(new LocationsCounter()) .uid("locations-counter"); locationStream.addSink(sinkFunc).name("sink").uid("sink");
  • 11.
    11 Creating a newFlink Job Flink Library Application Metrics Logging SerDe Utils
  • 12.
    12 Deployment and tooling JobManager TMTM TMTM S3Command line
  • 13.
    13 Learnings + Boilerplate generators👍 - Constant maintenance and updates + Automating observability 💯 - User education is a challenge + Command line tooling to make interaction with jobs easy - Manual and hence error-prone
  • 14.
  • 15.
  • 16.
    16 Balancing shard assignments ●Problem: Re-sharding of Kinesis streams with non-sequential shard IDs ● Custom shard assigner that can be extended ● Round-robin shard assignment (across re-sharding of a stream)
  • 17.
    17 Faster and adaptivereads Constant # Records Variable bytes
  • 18.
    18 Faster and adaptivereads Variable # records Constant (max) bytes
  • 19.
    19 Event time skewin source partitions Shard Consumer t=0 t=1 t=2 t=100 Event time scale 1 1 123 2 11 12 13 11 10 15 10 13 12 11 10 15 7 8 7 8 Shard Consumer State t=1 t=5 t=10 t=15 1 1 3 2 12 1 1
  • 20.
  • 21.
  • 22.
    22 Flink contributions ● Shardbalancing ○ Per-shard watermarking [FLINK-5697] ○ Custom Shard assignment [FLINK-8516] ○ ListShards API [FLINK-8944] ● Adaptive reads ○ Adding adaptive reads [FLINK-9692] ● Source synchronization ○ Global Aggregate Manager in JobMaster [FLINK-10887] ○ Source synchronization in the Kinesis connector [FLINK-10921] ○ (Long term) Event time alignment as a part of FLIP-27
  • 23.
    23 Checkpointing ● Transient errorsin reading and writing checkpoints to S3 ● Frequent restarts Filesystem Wrapper Application Connector Extensions Flink Library Application
  • 24.
    24 Checkpointing ● Hot partitions s3://<bucket>/application/checkpoints/ ●[FLINK-9061] Entropy injection ● Entropy injected path s3://<bucket>/<HASH>/application/checkpoints/ ● Externalized checkpoints ✅
  • 25.
    25 Challenges ● A lotof data pipeline use cases ○ Log ingestion to Kibana ○ Change data capture logs ○ Analytics event pipeline ● Common requirement ○ Data freshness ○ Data completeness (eventual) ● During an outage ○ Recover job and resume real-time data processing ○ Backfill mechanism ● Checkpointing under backpressure ○ Failing checkpoints ○ Limited forward progress ○ [FLIP-76] ○ Debugging bottleneck is hard ● Kafka consumer improvements ○ Idle partition detection ○ Source synchronization ● Recovery and HA ○ Partial recovery from task failures [FLIP-1] ○ Zookeeper and HA [FLINK-10030]
  • 26.
    26 Learnings ● Bandwidth controlis important ● Wrappers make users ☺ ● Investing in a performance bench for every new connector is 💯 ● Checkpointing under backpressure 😨 ● Open source ⇒ 🎉🎉🎉
  • 27.
  • 28.
  • 29.
    29 Flink on k8s:Motivation ● Deploying new jobs required provisioning AWS resources ● 10+ minute replacement time for node failures ● Manual, error-prone process to update jobs, rollback on failures ● Need for immutable infrastructure
  • 30.
    30 Flink on k8s:Overview ● Core component: Flinkk8soperator[1] ● Operator manages Flink applications Job Definition JM TM Flink operator TM TM TM [1] Managing Flink on Kubernetes
  • 31.
    Flinkk8soperator: Design goal Anyoperation that produces downtime should either succeed, or roll back as soon as the error is detected
  • 32.
    Flink Application CRD ●Flink Application custom resource definition ● Describes desired state for the Flink application ● Flink operator responsible for evolving the application to that state apiVersion: flink.k8s.io/v1beta1 kind: FlinkApplication metadata: name: wordcount spec: image: lyft/wordcount:latest jarName: "wordcount-1.0.0-SNAPSHOT.jar" parallelism: 30 entryClass: "com.lyft.WordCount" flinkVersion: "1.8" flinkConfig: state.backend: filesystem jobManagerConfig: resources: requests: memory: "8Gi" cpu: "4" taskManagerConfig: resources: requests: memory: "8Gi" cpu: "8"
  • 33.
    33 New Create Cluster SubmitJobSavepointing Rolling back Updating Update Failed Running UPDATING RUNNING State Machine
  • 34.
    34 Flink on k8s:Wins ● Improved stability and recoverability ● Less downtime due to common failures ● Easier configuration and tooling ● More flexible deployment strategies
  • 35.
    35 Open Source We open-sourcedthe Flink Operator in May. Already in production use @ Lyft. (https://github.com/lyft/flinkk8soperator) ● 5 external contributors ● 20 external contributions ● 33 issues opened
  • 36.
    36 We are hiring!lyft.com/careers Q & A