Exactly-Once Financial Data Processing at
Scale with Flink and Pinot
Speakers
2
Xiang Zhang
Stripe
Pratyush Sharma
Stripe
Xiaoman Dong
StarTree
Agenda
Problem: near real-time end-to-end exactly once processing pipeline at scale
The architecture: Kafka, Flink, Pinot and how to connect all together
Operational challenges and learnings
3
1
2
3
The problem to solve—Ledger dataset
Ledger is a data set that Stripe
maintains to record all money
movements
4
Requirements for the Ledger pipeline
Near real-time processing to meet SLO targets (p99 in orders of minutes; p90 < 1 minute)
5
1
Requirements for the Ledger pipeline
Near real-time processing to meet SLO targets (p99 in orders of minutes p90 < 1 minute)
Be able to process events at scale
6
1
2
Requirements for the Ledger pipeline
Near real-time processing to meet SLO targets (p99 in orders of minutes p90 < 1 minute)
Be able to process events at scale
No missing transactions: a single transaction can be of millions of dollars
7
1
2
3
Requirements for the Ledger pipeline
Near real-time processing to meet SLO targets (p99 in orders of minutes p90 < 1 minute)
Be able to process events at scale
No missing transactions: a single transaction can be of millions of dollars
No duplicate transactions across the entire history:
● Duplicates are inevitable on the source sides (deployments, restarts, accidental
duplicate job executions etc.)
8
1
2
3
4
Requirements for the Ledger pipeline
Near real-time processing to meet SLO targets (p99 in orders of minutes p90 < 1 minute)
Be able to process events at scale
No missing transactions: a single transaction can be of millions of dollars
No duplicate transactions across the entire history
9
1
2
3
4
Near real-time end-to-end exactly-once processing at scale!
Agenda
Problem: near real-time end-to-end exactly once processing pipeline at scale
The architecture: Kafka, Flink, Pinot and how to connect all together
Operational challenges and learnings
10
1
2
3
High-Level Pipeline
11
High-Level Pipeline
12
The Deduplicator
13
In reality, we store transactions IDs in Flink state for deduplication
Flink End-to-End Exactly Once Processing - Flink Deduplicator (1/3)
14
Source: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
Flink End-to-End Exactly Once Processing - Flink Deduplicator (2/3)
15
Source: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
Flink End-to-End Exactly Once Processing - Flink Deduplicator (3/3)
16
Source: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
High-Level Pipeline
17
Pinot Exactly Once Ingestion (1/5)
18
Pinot Exactly Once Ingestion (2/5)
19
● Pinot table rows are stored in
immutable chunks/batches called
segments
● Real time segments being indexed are
mutable. Once they are full they will be
“sealed” and become immutable. New
mutable segments will be created to
continue indexing.
Pinot Exactly Once Ingestion (3/5)
20
We can consider Pinot’s latest segment as one
database transaction:
● Transaction begins at segment creation
● Transaction is committed when “sealed”
● Kafka offset stored atomically along with Pinot
segment metadata
● If any exception happens, the whole
transaction (segment) is aborted and restarted
Pinot Exactly Once Ingestion (4/5)
21
{
"segment.crc": "3251475672",
"segment.creation.time": "1648231912328",
"segment.download.url":
"s3://some/table/mytable__8__0__20220325T1811Z",
"segment.end.time": "1388707200000",
"segment.flush.threshold.size": "4166",
"segment.index.version": "v3",
"segment.realtime.endOffset": "10264",
"segment.realtime.numReplicas": "2",
"segment.realtime.startOffset": "10240",
"segment.realtime.status": "DONE",
"segment.start.time": "1388707200000",
"segment.time.unit": "MILLISECONDS",
"segment.total.docs": "24"
}
● Each segment has one single Zookeeper
node storing its metadata
● Kafka Offsets are stored inside segment
metadata
● Atomicity
○ Zookeeper node update is atomic
○ Kafka offset is updated at the same
time segment status updates
(“DONE”)
Pinot Exactly Once Ingestion (5/5)
22
If Pinot server is restarted or crashed
● Whole segment is discarded
● Segment recreated starting from the
next position of offset from Segment_0
● Kafka consumer seek() is called
Agenda
Problem: near real-time end-to-end exactly once processing pipeline at scale
The architecture: Kafka, Flink, Pinot and how to connect all together
Operational challenges and learnings
23
1
2
3
Caveats of exactly-once - nothing is free!
Exactly-once is not bulletproof. Data loss or duplicate can still happen.
24
1
Caveats of exactly-once - nothing is free!
Exactly-once is not bulletproof. Data loss or duplicate can still happen.
It might give users a false sense of security.
25
1
2
Caveats of exactly-once - nothing is free!
Exactly-once is not bulletproof. Data loss or duplicate can still happen.
It might give users a false sense of security.
Hard to add additional layers to the architecture due to transactional guarantee.
26
1
2
3
Caveats of exactly-once - nothing is free!
Exactly-once is not bulletproof. Data loss or duplicate can still happen.
It might give users a false sense of security.
Hard to add additional layers to the architecture due to transactional guarantee.
Latency and SLO is impacted by checkpoint intervals.
27
1
2
3
4
Potential data loss in two-phase commit
28
Source: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
Potential data loss in two-phase commit
29
Source: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
The transaction can be expired in Kafka!
Optimizing large state hydration at recovery time
The ledger deduplicator app maintains tens of terabytes of states to do all time
deduplication
30
1
Optimizing large state hydration at recovery time
The ledger deduplicator app maintains tens of terabytes of states to do all time deduplication
Task local recovery doesn’t work with multiple disks mounted (FLINK-10954)
● Need to hydrate the entire state everytime the job is rescheduled (job failure, host
failure/restarts/recycle)
● Impacts end to end latency
31
1
2
Optimizing large state hydration at recovery time
The ledger deduplicator app maintains tens of terabytes of states to do all time deduplication
Task local recovery doesn’t work with multiple disks mounted (FLINK-10954)
● Need to hydrate the entire state everytime the job is rescheduled (job failure, host
failure/restarts/recycle)
● Impacts end to end latency
Even if we make local recovery work, Stripe recycles hosts periodically
● The pipeline is as slow as the slowest host to recover state
32
1
2
3
Optimizing large state hydration at recovery time
● Task parallelism increase to the rescue: the more threads, the faster to download the state and
to rebuild the local state DB!
33
○ Increasing parallelism requires state
redistribution.
○ Flink uses the concept of key-group
as an atomic unit for state
distribution.
Parallelism Increase from 180 to 270 doesn’t work
34
Each task gets 2 key groups assigned
1. 180 tasks have 1 key group
2. 90 tasks have 2 key groups
What we want is even distribution
35
Each task gets 2 key groups assigned Each task gets 1 key group assigned
Monitoring Large State Size
● Flink can report native RocksDB metrics.
● State backend latency tracking metrics can help debugging.
● Large pending Rocks DB compactions can affect performance.
36
Linux OOM Kills Causing Job Restarts
● Flink < 1.12 uses glibc to allocate memory, which leads to memory fragmentation.
● Combined with large states required by the deduplicator app, it consistently causes OOM.
● With large number of task managers and time it takes to rehydrate state, it impacts latency SLO.
37
jemalloc Everywhere
● Flink switched to jemalloc for its default memory allocator in Docker images in Flink 1.12.
38
Pre jemalloc Post jemalloc
Data Quality Monitoring
● Pinot is an analytics platform that runs SQL blazingly fast, so…
○ Duplicate detection:
■ SELECT primary_key, count(*) as cnt FROM mytable
GROUP BY primary_key HAVING cnt > 1
■ Run query in REALTIME only to help query performance by using special table name
like mytable_REALTIME
○ Missing entry detection:
■ Bucket rows by time and count by bucket
■ JOIN/Compare to source of truth (upstream metric in Data Warehouse)
39
How to repair data in Pinot?
● If some range of data are corrupted (contains duplicate)
○ Find the duplicated data by SQL query.
○ Delete and rebuild the Pinot segments containing duplicates.
○ Pinot virtual column names like $segmentName helps locating segments.
● Best Practices
○ A reliable exactly-once Kafka Archive (backup) will come in handy in a fire.
○ Build stable/reliable timestamp into primary key, use that timestamp as Pinot timestamp.
40
Lessons Learned
● Flink
○ Set a Kafka transaction timeout large enough to account for any job downtime.
○ Set a parallelism to a number such that max parallelism is divisible by this number.
○ Use jemalloc in Flink.
● Pinot
○ Higher Kafka transaction frequency and shorter Flink checkpoint intervals will improve end
to end data freshness in Pinot.
○ Beware of bogus message counts: Many Kafka internal metrics include messages of failed
transactions.
○ Duplicate monitoring is a must for critical apps.
41

Exactly-Once Financial Data Processing at Scale with Flink and Pinot

  • 1.
    Exactly-Once Financial DataProcessing at Scale with Flink and Pinot
  • 2.
  • 3.
    Agenda Problem: near real-timeend-to-end exactly once processing pipeline at scale The architecture: Kafka, Flink, Pinot and how to connect all together Operational challenges and learnings 3 1 2 3
  • 4.
    The problem tosolve—Ledger dataset Ledger is a data set that Stripe maintains to record all money movements 4
  • 5.
    Requirements for theLedger pipeline Near real-time processing to meet SLO targets (p99 in orders of minutes; p90 < 1 minute) 5 1
  • 6.
    Requirements for theLedger pipeline Near real-time processing to meet SLO targets (p99 in orders of minutes p90 < 1 minute) Be able to process events at scale 6 1 2
  • 7.
    Requirements for theLedger pipeline Near real-time processing to meet SLO targets (p99 in orders of minutes p90 < 1 minute) Be able to process events at scale No missing transactions: a single transaction can be of millions of dollars 7 1 2 3
  • 8.
    Requirements for theLedger pipeline Near real-time processing to meet SLO targets (p99 in orders of minutes p90 < 1 minute) Be able to process events at scale No missing transactions: a single transaction can be of millions of dollars No duplicate transactions across the entire history: ● Duplicates are inevitable on the source sides (deployments, restarts, accidental duplicate job executions etc.) 8 1 2 3 4
  • 9.
    Requirements for theLedger pipeline Near real-time processing to meet SLO targets (p99 in orders of minutes p90 < 1 minute) Be able to process events at scale No missing transactions: a single transaction can be of millions of dollars No duplicate transactions across the entire history 9 1 2 3 4 Near real-time end-to-end exactly-once processing at scale!
  • 10.
    Agenda Problem: near real-timeend-to-end exactly once processing pipeline at scale The architecture: Kafka, Flink, Pinot and how to connect all together Operational challenges and learnings 10 1 2 3
  • 11.
  • 12.
  • 13.
    The Deduplicator 13 In reality,we store transactions IDs in Flink state for deduplication
  • 14.
    Flink End-to-End ExactlyOnce Processing - Flink Deduplicator (1/3) 14 Source: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
  • 15.
    Flink End-to-End ExactlyOnce Processing - Flink Deduplicator (2/3) 15 Source: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
  • 16.
    Flink End-to-End ExactlyOnce Processing - Flink Deduplicator (3/3) 16 Source: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
  • 17.
  • 18.
    Pinot Exactly OnceIngestion (1/5) 18
  • 19.
    Pinot Exactly OnceIngestion (2/5) 19 ● Pinot table rows are stored in immutable chunks/batches called segments ● Real time segments being indexed are mutable. Once they are full they will be “sealed” and become immutable. New mutable segments will be created to continue indexing.
  • 20.
    Pinot Exactly OnceIngestion (3/5) 20 We can consider Pinot’s latest segment as one database transaction: ● Transaction begins at segment creation ● Transaction is committed when “sealed” ● Kafka offset stored atomically along with Pinot segment metadata ● If any exception happens, the whole transaction (segment) is aborted and restarted
  • 21.
    Pinot Exactly OnceIngestion (4/5) 21 { "segment.crc": "3251475672", "segment.creation.time": "1648231912328", "segment.download.url": "s3://some/table/mytable__8__0__20220325T1811Z", "segment.end.time": "1388707200000", "segment.flush.threshold.size": "4166", "segment.index.version": "v3", "segment.realtime.endOffset": "10264", "segment.realtime.numReplicas": "2", "segment.realtime.startOffset": "10240", "segment.realtime.status": "DONE", "segment.start.time": "1388707200000", "segment.time.unit": "MILLISECONDS", "segment.total.docs": "24" } ● Each segment has one single Zookeeper node storing its metadata ● Kafka Offsets are stored inside segment metadata ● Atomicity ○ Zookeeper node update is atomic ○ Kafka offset is updated at the same time segment status updates (“DONE”)
  • 22.
    Pinot Exactly OnceIngestion (5/5) 22 If Pinot server is restarted or crashed ● Whole segment is discarded ● Segment recreated starting from the next position of offset from Segment_0 ● Kafka consumer seek() is called
  • 23.
    Agenda Problem: near real-timeend-to-end exactly once processing pipeline at scale The architecture: Kafka, Flink, Pinot and how to connect all together Operational challenges and learnings 23 1 2 3
  • 24.
    Caveats of exactly-once- nothing is free! Exactly-once is not bulletproof. Data loss or duplicate can still happen. 24 1
  • 25.
    Caveats of exactly-once- nothing is free! Exactly-once is not bulletproof. Data loss or duplicate can still happen. It might give users a false sense of security. 25 1 2
  • 26.
    Caveats of exactly-once- nothing is free! Exactly-once is not bulletproof. Data loss or duplicate can still happen. It might give users a false sense of security. Hard to add additional layers to the architecture due to transactional guarantee. 26 1 2 3
  • 27.
    Caveats of exactly-once- nothing is free! Exactly-once is not bulletproof. Data loss or duplicate can still happen. It might give users a false sense of security. Hard to add additional layers to the architecture due to transactional guarantee. Latency and SLO is impacted by checkpoint intervals. 27 1 2 3 4
  • 28.
    Potential data lossin two-phase commit 28 Source: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
  • 29.
    Potential data lossin two-phase commit 29 Source: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html The transaction can be expired in Kafka!
  • 30.
    Optimizing large statehydration at recovery time The ledger deduplicator app maintains tens of terabytes of states to do all time deduplication 30 1
  • 31.
    Optimizing large statehydration at recovery time The ledger deduplicator app maintains tens of terabytes of states to do all time deduplication Task local recovery doesn’t work with multiple disks mounted (FLINK-10954) ● Need to hydrate the entire state everytime the job is rescheduled (job failure, host failure/restarts/recycle) ● Impacts end to end latency 31 1 2
  • 32.
    Optimizing large statehydration at recovery time The ledger deduplicator app maintains tens of terabytes of states to do all time deduplication Task local recovery doesn’t work with multiple disks mounted (FLINK-10954) ● Need to hydrate the entire state everytime the job is rescheduled (job failure, host failure/restarts/recycle) ● Impacts end to end latency Even if we make local recovery work, Stripe recycles hosts periodically ● The pipeline is as slow as the slowest host to recover state 32 1 2 3
  • 33.
    Optimizing large statehydration at recovery time ● Task parallelism increase to the rescue: the more threads, the faster to download the state and to rebuild the local state DB! 33 ○ Increasing parallelism requires state redistribution. ○ Flink uses the concept of key-group as an atomic unit for state distribution.
  • 34.
    Parallelism Increase from180 to 270 doesn’t work 34 Each task gets 2 key groups assigned 1. 180 tasks have 1 key group 2. 90 tasks have 2 key groups
  • 35.
    What we wantis even distribution 35 Each task gets 2 key groups assigned Each task gets 1 key group assigned
  • 36.
    Monitoring Large StateSize ● Flink can report native RocksDB metrics. ● State backend latency tracking metrics can help debugging. ● Large pending Rocks DB compactions can affect performance. 36
  • 37.
    Linux OOM KillsCausing Job Restarts ● Flink < 1.12 uses glibc to allocate memory, which leads to memory fragmentation. ● Combined with large states required by the deduplicator app, it consistently causes OOM. ● With large number of task managers and time it takes to rehydrate state, it impacts latency SLO. 37
  • 38.
    jemalloc Everywhere ● Flinkswitched to jemalloc for its default memory allocator in Docker images in Flink 1.12. 38 Pre jemalloc Post jemalloc
  • 39.
    Data Quality Monitoring ●Pinot is an analytics platform that runs SQL blazingly fast, so… ○ Duplicate detection: ■ SELECT primary_key, count(*) as cnt FROM mytable GROUP BY primary_key HAVING cnt > 1 ■ Run query in REALTIME only to help query performance by using special table name like mytable_REALTIME ○ Missing entry detection: ■ Bucket rows by time and count by bucket ■ JOIN/Compare to source of truth (upstream metric in Data Warehouse) 39
  • 40.
    How to repairdata in Pinot? ● If some range of data are corrupted (contains duplicate) ○ Find the duplicated data by SQL query. ○ Delete and rebuild the Pinot segments containing duplicates. ○ Pinot virtual column names like $segmentName helps locating segments. ● Best Practices ○ A reliable exactly-once Kafka Archive (backup) will come in handy in a fire. ○ Build stable/reliable timestamp into primary key, use that timestamp as Pinot timestamp. 40
  • 41.
    Lessons Learned ● Flink ○Set a Kafka transaction timeout large enough to account for any job downtime. ○ Set a parallelism to a number such that max parallelism is divisible by this number. ○ Use jemalloc in Flink. ● Pinot ○ Higher Kafka transaction frequency and shorter Flink checkpoint intervals will improve end to end data freshness in Pinot. ○ Beware of bogus message counts: Many Kafka internal metrics include messages of failed transactions. ○ Duplicate monitoring is a must for critical apps. 41

Editor's Notes

  • #19 Transitional slides from talking about deduplicator to pinot ingestion.
  • #20 Transitional slides from talking about deduplicator to pinot ingestion.
  • #21 Transitional slides from talking about deduplicator to pinot ingestion.
  • #22 Transitional slides from talking about deduplicator to pinot ingestion.
  • #23 Transitional slides from talking about deduplicator to pinot ingestion.