Running Flink in Production: The good, The bad and The in Between - Lakshmi Rao, Lyft (1)

Running Flink in
Production
Oct 8, 2019
Lakshmi Rao | @glaksh100

Dynamic Pricing
Supply/Demand curve
Pricing
Notiﬁcations
ETA Prediction
Coupons
User Experience
Fraud
Behaviour Fingerprinting
Monetary Impact
Imperative to act fast
Top Destinations
Core Experience

Data Platform @ Lyft
Backend Services
Mobile app
Streaming sources
Streaming sinks
Events

❖ Data pipelines
❖ Marketplace: Speed
observations, locations data
❖ Building ML models
❖ Pricing platform
❖ Data engineers
❖ Research scientists
❖ Product Managers
❖ Business operations
Analytics ML Engineers Engineers
Users @ Lyft

8
Abstraction levels
Flink jobs
API: Java
Users: Engineers
Use cases: Data pipelines,
ETA, Mapping
% Jobs: 45%
Dryft
API: Streaming SQL
Users: Research Scientists,
Data Engineers
Use cases: Fraud detection,
coupons
% Jobs: 53%
Flink + Python ⇒ Beam
API: Python
Users: ML Engineers
Use cases: Dynamic
Pricing[1]
% Jobs: 2%
2
1
3
Streaming Platform 0
[1] Streaming your Lyft ride prices

10
Creating a new Flink Job
● Boilerplate generator to kickstart a new Flink job
● https://yeoman.io/ → [yo streamingservice helloworld]
● A representative example to make development easy
DataStream<KinesisRecord> recordStream = env.addSource(sourceFunc).name("source").uid("source");
DataStream<UserAndCount> locationStream =
recordStream
.flatMap(new LocationEventExtractor())
.filter(new LocationFilter())
.keyBy("userId")
.timeWindow(Time.seconds(1))
.aggregate(new LocationsCounter())
.uid("locations-counter");
locationStream.addSink(sinkFunc).name("sink").uid("sink");

11
Creating a new Flink Job
Flink
Library
Application Metrics
Logging
SerDe Utils

12
Deployment and tooling
Job Manager
TMTM
TMTM
S3Command line

13
Learnings
+ Boilerplate generators 👍
- Constant maintenance and updates
+ Automating observability 💯
- User education is a challenge
+ Command line tooling to make interaction with jobs easy
- Manual and hence error-prone

14
Production stories
Image Credit: K.C. Green

16
Balancing shard assignments
● Problem: Re-sharding of Kinesis streams with non-sequential shard IDs
● Custom shard assigner that can be extended
● Round-robin shard assignment (across re-sharding of a stream)

17
Faster and adaptive reads
Constant # Records
Variable bytes

18
Faster and adaptive reads
Variable # records
Constant (max) bytes

19
Event time skew in source partitions
Shard
Consumer
t=0 t=1 t=2 t=100
Event time scale
1
1 123
2
11
12
13 11
10
15
10
13
12
11
10
15
7
8
7
8
Shard
Consumer
State
t=1 t=5 t=10 t=15
1
1 3
2
12
1
1

20
Source Synchronization
Shard
Thread
Shard
Thread
Source
Thread
Global
watermark

22
Flink contributions
● Shard balancing
○ Per-shard watermarking [FLINK-5697]
○ Custom Shard assignment [FLINK-8516]
○ ListShards API [FLINK-8944]
● Adaptive reads
○ Adding adaptive reads [FLINK-9692]
● Source synchronization
○ Global Aggregate Manager in JobMaster [FLINK-10887]
○ Source synchronization in the Kinesis connector [FLINK-10921]
○ (Long term) Event time alignment as a part of FLIP-27

23
Checkpointing
● Transient errors in reading and writing checkpoints to S3
● Frequent restarts
Filesystem Wrapper
Application
Connector Extensions
Flink
Library
Application

24
Checkpointing
● Hot partitions
s3://<bucket>/application/checkpoints/
● [FLINK-9061] Entropy injection
● Entropy injected path
s3://<bucket>/<HASH>/application/checkpoints/
● Externalized checkpoints ✅

25
Challenges
● A lot of data pipeline use cases
○ Log ingestion to Kibana
○ Change data capture logs
○ Analytics event pipeline
● Common requirement
○ Data freshness
○ Data completeness (eventual)
● During an outage
○ Recover job and resume real-time data
processing
○ Backﬁll mechanism
● Checkpointing under backpressure
○ Failing checkpoints
○ Limited forward progress
○ [FLIP-76]
○ Debugging bottleneck is hard
● Kafka consumer improvements
○ Idle partition detection
○ Source synchronization
● Recovery and HA
○ Partial recovery from task failures [FLIP-1]
○ Zookeeper and HA [FLINK-10030]

26
Learnings
● Bandwidth control is important
● Wrappers make users ☺
● Investing in a performance bench for every new connector is 💯
● Checkpointing under backpressure 😨
● Open source ⇒ 🎉🎉🎉

28
Kubernetes
Image Credit: Dilbert

29
Flink on k8s: Motivation
● Deploying new jobs required provisioning AWS resources
● 10+ minute replacement time for node failures
● Manual, error-prone process to update jobs, rollback on failures
● Need for immutable infrastructure

30
Flink on k8s: Overview
● Core component: Flinkk8soperator[1]
● Operator manages Flink applications
Job
Deﬁnition
JM
TM
Flink
operator
TM
TM TM
[1] Managing Flink on Kubernetes

Flinkk8soperator: Design goal
Any operation that produces downtime should either
succeed, or roll back as soon as the error is detected

Flink Application CRD
● Flink Application custom
resource deﬁnition
● Describes desired state for the
Flink application
● Flink operator responsible for
evolving the application to that
state
apiVersion: flink.k8s.io/v1beta1
kind: FlinkApplication
metadata:
name: wordcount
spec:
image: lyft/wordcount:latest
jarName: "wordcount-1.0.0-SNAPSHOT.jar"
parallelism: 30
entryClass: "com.lyft.WordCount"
flinkVersion: "1.8"
flinkConfig:
state.backend: filesystem
jobManagerConfig:
resources:
requests:
memory: "8Gi"
cpu: "4"
taskManagerConfig:
resources:
requests:
memory: "8Gi"
cpu: "8"

33
New
Create Cluster Submit JobSavepointing
Rolling back
Updating
Update
Failed
Running
UPDATING
RUNNING
State Machine

34
Flink on k8s: Wins
● Improved stability and recoverability
● Less downtime due to common failures
● Easier conﬁguration and tooling
● More ﬂexible deployment strategies

35
Open Source
We open-sourced the Flink Operator in May. Already in
production use @ Lyft.
(https://github.com/lyft/ﬂinkk8soperator)
● 5 external contributors
● 20 external contributions
● 33 issues opened

36
We are hiring! lyft.com/careers
Q & A

Running Flink in Production: The good, The bad and The in Between - Lakshmi Rao, Lyft (1)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Running Flink in Production: The good, The bad and The in Between - Lakshmi Rao, Lyft (1)

Similar to Running Flink in Production: The good, The bad and The in Between - Lakshmi Rao, Lyft (1) (20)

More from Flink Forward

More from Flink Forward (20)

Recently uploaded

Recently uploaded (20)

Running Flink in Production: The good, The bad and The in Between - Lakshmi Rao, Lyft (1)