"Is your team looking to bring the power of full, end-to-end stream processing with Apache Flink to your organization but are concerned about the time, resources or skills required? In this talk, Sharon Xie, Decodable Founding Engineer and Apache Flink PMC Member, Robert Metzger, will reveal the biggest lessons learned, and how to avoid common mistakes when adopting Apache Flink. If you have any plans on implementing Apache Flink, then this is a session you do not want to miss.
We will talk about avoiding data-loss with Flink’s Kafka exactly-once producer, configuring Flink for getting the most bang for the buck out of your memory configuration and tuning for efficient checkpointing."
1. 3 Flink Mistakes We Made
So You Won’t Have To
Robert Metzger, Staff Engineer @ Decodable
Apache Flink Committer and PMC Chair
Sharon Xie, Founding Engineer @ Decodable
2. What we’ll be talking about today
#1 Data Loss with Flink Exactly-Once Delivery to Kafka
#2 Inefficient Memory Configuration
#3 Inefficient Checkpointing Config
3. #1 Data Loss with Flink Exactly-Once
Delivery to Kafka
10. ● Flink Kafka Producer creates a new transaction id for each checkpoint per task
● transactional.id.expiration.ms = 604800000 (7 days)
Excessive Memory Usage
11. ● transaction.max.timeout.ms = 604800000 (7 days)
○ From default: 15min
● transactional.id.expiration.ms = 3600000 (1 hour)
○ From default: 7 days
Better Kafka Transaction Configuration
12. When a checkpoint/savepoint to restore is over 1 hour (the new
transactional.id.expiration.ms) old
org.apache.kafka.common.errors.InvalidPidMappingException: The
producer attempted to use a producer id which is not currently
assigned to its transactional id.
InvalidPidMappingException
13. Short-term: Ignore InvalidPidMappingException 😇
● ONLY when transaction.timeout.ms (Kafka client configuration in Flink)
> transactional.id.expiration.ms
Long-term: 🤝
● KIP-939: Support Participation in 2PC
● FLIP-319: Integrate with Kafka's Support for Proper 2PC Participation
Fix InvalidPidMappingException
14. What we’ll be talking about today
#1 Data Loss with Flink Exactly-Once Delivery to Kafka ✅
#2 Inefficient Memory Configuration
#3 Inefficient Checkpointing Config
16. How to Tune TaskManager Memory
● Flink automatically computes memory budgets
Just provide total process size.
● Main memory consumers
○ Framework + Task heap
○ RocksDB State backend (off-heap)
○ Network stack (off-heap)
○ JVM internal structures [metaspace, thread
stacks] (off-heap)
17. How to Tune TaskManager Memory
● Example: taskmanager.memory.process.size: 8gb
JVM internal structures
[metaspace, thread stacks]
(off-heap)
Framework + Task heap
RocksDB State backend
(off-heap)
Network stack (off-heap)
18. How to Tune TaskManager Memory
● Let’s tune for this particular job
150mb
700mb
2300mb
= 3150mb unused memory
19. How to Tune TaskManager Memory
● Give as much memory as possible to Managed Memory = RocksDB
taskmanager.memory.task.heap.size: 1 gb
taskmanager.memory.managed.size: 5800 mb
taskmanager.memory.network.min: 32 mb
taskmanager.memory.network.max: 32 mb
taskmanager.memory.jvm-metaspace.size: 120 mb
20. ● Stateful workloads with RocksDB benefit
most from as much memory as possible
→ Check out the full documentation:
https://nightlies.apache.org/flink/flink-docs
-master/docs/deployment/memory/mem_s
etup/
Memory Configuration Wrap Up
21. What we’ll be talking about today
#1 Data Loss with Flink Exactly-Once Delivery to Kafka ✅
#2 Inefficient Memory Configuration ✅
#3 Inefficient Checkpointing Config
24. state.backend.local-recovery: true
Local recovery: Only re-download the state on
failed machines
After a failure without local recovery:
All TaskManagers download the state
TM1 TM2 TM3 TM4
1 - TM4 fails
TM1 TM2 TM3 TM4
2 - Recovery
With local recovery: Most machines use local
disks, only one needs to download
TM1 TM2 TM3 TM4
1 - TM4 fails
TM1 TM2 TM3 TM4
2 - Recovery
Reliable, Fast Checkpointing
25. Fast Checkpointing and State
Put your RocksDB state on the fastest available
disk. Typically a local SSD.
TaskManager
Your Flink
Worker
Remote EBS
Volume
Your Flink
Worker
TaskManager
Local
SSD
26. The End – Q&A
Robert Metzger, Staff Engineer @ Decodable
Apache Flink Committer and PMC Chair
Sharon Xie, Founding Engineer @ Decodable
Get your free decodable.co account today if you want us to
handle the issues discussed in the talk.
Visit the Decodable Booth (201) for any Flink related questions.
27. Fast Checkpointing and State
● RocksDB stores your state on the /tmp directory
● On AWS Kubernetes, that’s by default an EBS volume
Type Size IOPS (max) Throughput Price per Month
io1 950 GB 64000 $4278
io2 block express 950 GB 256000 $9769
gp3 950 GB 16000 1000 mb/s $176
M6gd.4xlarge
64g | 16c
950 GB Read: 93000
Write: 222000
$+78 per instance for
a local NVMe SSD
→ Using an instance type with a local SSD gives you by far the best performance per $
We just mount the entire Docker working directory on the local SSD.
28. ● Flink EO with Kafka can still cause data loss
● Transaction timeout is the key
● Flink EO implementation can consume excessive memory from Kafka
● A better approach with Flink + Kafka is under way
Recap