3 Flink Mistakes We Made So You Won't Have To

3 Flink Mistakes We Made
So You Won’t Have To
Robert Metzger, Staff Engineer @ Decodable
Apache Flink Committer and PMC Chair
Sharon Xie, Founding Engineer @ Decodable

What we’ll be talking about today
#1 Data Loss with Flink Exactly-Once Delivery to Kafka
#2 Inefficient Memory Configuration
#3 Inefficient Checkpointing Config

#1 Data Loss with Flink Exactly-Once
Delivery to Kafka

Two Phase Commit for EO - Happy Path

Two Phase Commit for EO - Phase 1 Failure

Two Phase Commit for EO - Phase 2 Failure

Life is doomed when…
Phase 2 can’t be successful 💣🔥

Important Kafka Broker Configurations
transaction.max.timeout.ms
● Default: 900000 (15 minutes)
transactional.id.expiration.ms
● Default: 604800000 (7 days)

● Flink Kafka Producer creates a new transaction id for each checkpoint per task
● transactional.id.expiration.ms = 604800000 (7 days)
Excessive Memory Usage

● transaction.max.timeout.ms = 604800000 (7 days)
○ From default: 15min
● transactional.id.expiration.ms = 3600000 (1 hour)
○ From default: 7 days
Better Kafka Transaction Configuration

When a checkpoint/savepoint to restore is over 1 hour (the new
transactional.id.expiration.ms) old
org.apache.kafka.common.errors.InvalidPidMappingException: The
producer attempted to use a producer id which is not currently
assigned to its transactional id.
InvalidPidMappingException

Short-term: Ignore InvalidPidMappingException 😇
● ONLY when transaction.timeout.ms (Kafka client configuration in Flink)
> transactional.id.expiration.ms
Long-term: 🤝
● KIP-939: Support Participation in 2PC
● FLIP-319: Integrate with Kafka's Support for Proper 2PC Participation
Fix InvalidPidMappingException

#1 Data Loss with Flink Exactly-Once Delivery to Kafka ✅

How to Tune TaskManager Memory
● Flink automatically computes memory budgets
Just provide total process size.
● Main memory consumers
○ Framework + Task heap
○ RocksDB State backend (off-heap)
○ Network stack (off-heap)
○ JVM internal structures [metaspace, thread
stacks] (off-heap)

● Example: taskmanager.memory.process.size: 8gb
JVM internal structures
[metaspace, thread stacks]
(off-heap)
Framework + Task heap
RocksDB State backend
(off-heap)
Network stack (off-heap)

● Let’s tune for this particular job
150mb
700mb
2300mb
= 3150mb unused memory

● Give as much memory as possible to Managed Memory = RocksDB
taskmanager.memory.task.heap.size: 1 gb
taskmanager.memory.managed.size: 5800 mb
taskmanager.memory.network.min: 32 mb
taskmanager.memory.network.max: 32 mb
taskmanager.memory.jvm-metaspace.size: 120 mb

● Stateful workloads with RocksDB benefit
most from as much memory as possible
→ Check out the full documentation:
https://nightlies.apache.org/flink/flink-docs
-master/docs/deployment/memory/mem_s
etup/
Memory Configuration Wrap Up

#1 Data Loss with Flink Exactly-Once Delivery to Kafka ✅
#2 Inefficient Memory Configuration ✅

execution.checkpointing.interval: 10s
execution.checkpointing.min-pause:
10s
Make sure your job is not
spending all the time
checkpointing
Image source: https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/large_state_tuning/#tuning-checkpointing
#3 Reliable, Fast Checkpointing

state.backend: rocksdb
state.backend.incremental: true
Only upload the diff to the last checkpoint
#33
full
#34
incremental
#35
incremental
Reliable, Fast Checkpointing

state.backend.local-recovery: true
Local recovery: Only re-download the state on
failed machines
After a failure without local recovery:
All TaskManagers download the state
TM1 TM2 TM3 TM4
1 - TM4 fails
TM1 TM2 TM3 TM4
2 - Recovery
With local recovery: Most machines use local
disks, only one needs to download
TM1 TM2 TM3 TM4
1 - TM4 fails
TM1 TM2 TM3 TM4
2 - Recovery
Reliable, Fast Checkpointing

Fast Checkpointing and State
Put your RocksDB state on the fastest available
disk. Typically a local SSD.
TaskManager
Your Flink
Worker
Remote EBS
Volume
Your Flink
Worker
TaskManager
Local
SSD

The End – Q&A
Robert Metzger, Staff Engineer @ Decodable
Apache Flink Committer and PMC Chair
Sharon Xie, Founding Engineer @ Decodable
Get your free decodable.co account today if you want us to
handle the issues discussed in the talk.
Visit the Decodable Booth (201) for any Flink related questions.

Fast Checkpointing and State
● RocksDB stores your state on the /tmp directory
● On AWS Kubernetes, that’s by default an EBS volume
Type Size IOPS (max) Throughput Price per Month
io1 950 GB 64000 $4278
io2 block express 950 GB 256000 $9769
gp3 950 GB 16000 1000 mb/s $176
M6gd.4xlarge
64g | 16c
950 GB Read: 93000
Write: 222000
$+78 per instance for
a local NVMe SSD
→ Using an instance type with a local SSD gives you by far the best performance per $
We just mount the entire Docker working directory on the local SSD.

● Flink EO with Kafka can still cause data loss
● Transaction timeout is the key
● Flink EO implementation can consume excessive memory from Kafka
● A better approach with Flink + Kafka is under way
Recap

3 Flink Mistakes We Made So You Won't Have To

Recommended

Recommended

More Related Content

Similar to 3 Flink Mistakes We Made So You Won't Have To

Similar to 3 Flink Mistakes We Made So You Won't Have To (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

3 Flink Mistakes We Made So You Won't Have To