Successfully reported this slideshow.
Your SlideShare is downloading. ×

Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Jazz for Service Management
Jazz for Service Management
Loading in …3
×

Check these out next

1 of 103 Ad

Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture

Download to read offline

Flink Forward San Francisco 2022.

Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.

by
Jeff Chao

Flink Forward San Francisco 2022.

Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.

by
Jeff Chao

Advertisement
Advertisement

More Related Content

Similar to Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture (20)

More from Flink Forward (20)

Advertisement

Recently uploaded (20)

Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture

  1. 1. 1
  2. 2. An API that gets out of your way It’s so easy, we’ve embedded a bunch of examples right here. Copy some of these requests into your terminal and check out what happens. With wrappers in Ruby, PHP, Python and more, you can get started in minutes. Learn More ➤
  3. 3. As complexity grew… Then we had a ProblemFactory Started out with We had a problem, so we thought to use …
  4. 4. As data volume grew… Database scalability is a complicated topic… Started out with Had to make sure it was web scale Distributed transactions Change Data Capture
  5. 5. Squirreling Away $640 Billion Flink Forward - San Francisco 2022 Jeff Chao Staff Engineer / Tech Lead for Change Data Capture Infrastructure at Stripe How Stripe Leverages Flink for Change Data Capture
  6. 6. 7 CDC at Stripe Agenda 1 Aggregating Change Events 2 How it Started, How it Ended 3 Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture Change Data Capture (CDC) is widely- used at Stripe to capture data changes from databases without critically impacting database reliability and scalability. CDC powers many critical financial use cases at Stripe such as the Stripe Dashboard, Stripe Search, Sigma, and Financial Reporting. From idea to production—things may seem straightforward at first, but the details matter. We detail our journey of how we leveraged Flink for Change Data Capture at Stripe in order to uphold the highest data quality standards. Freshness, Coverage, and Correctness SLOs are paramount to the success of platforms and applications running on top of our CDC infrastructure. Change Event Streams are ubiquitous across Stripe given the vast number of applications and employees generating datasets worldwide. Change Event Streams are independent from one another which leads to the typical challenges in distributed systems. One of the major use cases revolves around aggregating individual change events of a database transaction to support Stripe’s payments infrastructure.
  7. 7. Change Data Capture (CDC) is widely- used at Stripe to capture data changes from databases without critically impacting database reliability and scalability. CDC powers many critical financial use cases at Stripe such as the Stripe Dashboard, Stripe Search, Sigma, and Financial Reporting. 8 From idea to production—things may seem straightforward at first, but the details matter. We detail our journey of how we leveraged Flink for Change Data Capture at Stripe in order to uphold the highest data quality standards. Freshness, Coverage, and Correctness SLOs are paramount to the success of platforms and applications running on top of our CDC infrastructure. Change Event Streams are ubiquitous across Stripe given the vast number of applications and employees generating datasets worldwide. Change Event Streams are independent from one another which leads to the typical challenges in distributed systems. One of the major use cases revolves around aggregating individual change events of a database transaction to support Stripe’s payments infrastructure. Agenda CDC at Stripe 1 Aggregating Change Events 2 How it Started, How it Ended 3 Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture
  8. 8. Billing Capital Checkout Connect Invoicing Corporate Card Climate Atlas Radar Sigma Payouts Payments Terminal Treasury Issuing Revenue Recognitio n Payment Links Tax Identity Elements Data Pipeline Financial Connections
  9. 9. 30% 13 23 > 8000 Remote Countries Employees CDC at Stripe
  10. 10. Correctness Freshness Coverage 14 Strict SLOs CDC at Stripe
  11. 11. Interoperable Abstract Away Internals Operational Excellence 15 Building a Platform Make sure that we abstract away database internals such as sharding topology and ensure a datastore-agnostic transport. Build a high leveraged platform which makes working with Change Events interoperable with other systems within the organization. Minimal toil given as we scale the number of datasets, ensure clean separation between infrastructure and user issues, create great operator experiences, reduce control plane and data plane blast radius, maintain good operator tooling/developer experience/processes. CDC at Stripe
  12. 12. 16 Agenda CDC at Stripe 1 Aggregating Change Events 2 How it Started, How it Ended 3 Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture Change Data Capture (CDC) is widely- used at Stripe to capture data changes from databases without critically impacting database reliability and scalability. CDC powers many critical financial use cases at Stripe such as the Stripe Dashboard, Stripe Search, Sigma, and Financial Reporting. From idea to production—things may seem straightforward at first, but the details matter. We detail our journey of how we leveraged Flink for Change Data Capture at Stripe in order to uphold the highest data quality standards. Freshness, Coverage, and Correctness SLOs are paramount to the success of platforms and applications running on top of our CDC infrastructure. Change Event Streams are ubiquitous across Stripe given the vast number of applications and employees generating datasets worldwide. Change Event Streams are independent from one another which leads to the typical challenges in distributed systems. One of the major use cases revolves around aggregating individual change events of a database transaction to support Stripe’s payments infrastructure.
  13. 13. Why? 17 Aggregating Change Events Product teams working with payments data use transactions Arbitrary number of tables in a database transaction They should be able to get transactions back out from the CDC path They shouldn’t have to become stream processing experts
  14. 14. 18 Vites s Deb eziu m Kaf ka Platform Platform User Aggregating Change Events Architecture Mon go Kaf ka Flin k
  15. 15. What is a Change Event? 19 { "ts_utc" : 1659375300000, "attributes": { ... }, "data": [ { "operation": "CREATE", "source": { ... }, "transaction": { ... }, "key": "some-unique-constraint", "before": null, "after": { ... }, "attributes": { ... } } ] } Aggregating Change Events
  16. 16. What is a Change Event? 20 { "ts_utc" : 1659375300000, "attributes": { ... }, "data": [ { "operation": "CREATE", "source": { ... }, "transaction": { ... }, "key": "some-unique-constraint", "before": null, "after": { ... }, "attributes": { ... } } ] } Stream: charges Aggregating Change Events
  17. 17. What is a Change Event? 21 { "ts_utc" : 1659375300000, "attributes": { ... }, "data": [ { "operation": "CREATE", "source": { ... }, "transaction": { ... }, "key": "some-unique-constraint", "before": null, "after": { ... }, "attributes": { ... } } ] } Aggregating Change Events { "id" : "transaction-id", "global_position": 1, "source_position": 1, }
  18. 18. What is a Change Event? 22 { "ts_utc" : 1659375300000, "attributes": { ... }, "data": [ { "operation": "CREATE", "source": { ... }, "transaction": { ... }, "key": "some-unique-constraint", "before": null, "after": { ... }, "attributes": { ... } } ] } Aggregating Change Events { "id" : "transaction-id", "global_position": 1, "source_position": 1, }
  19. 19. What is a Change Event? 23 { "ts_utc" : 1659375300000, "attributes": { ... }, "data": [ { "operation": "CREATE", "source": { ... }, "transaction": { ... }, "key": "some-unique-constraint", "before": null, "after": { ... }, "attributes": { ... } } ] } Aggregating Change Events { "id" : "transaction-id", "global_position": 1, "source_position": 1, }
  20. 20. What is a Change Event? 24 { "ts_utc" : 1659375300000, "attributes": { ... }, "data": [ { "operation": "CREATE", "source": { ... }, "transaction": { ... }, "key": "some-unique-constraint", "before": null, "after": { ... }, "attributes": { ... } } ] } Aggregating Change Events { "id" : "transaction-id", "global_position": 1, "source_position": 1, }
  21. 21. What is a Change Event? 25 { "ts_utc" : 1659375300000, "attributes": { ... }, "data": [ { "operation": "CREATE", "source": { ... }, "transaction": { ... }, "key": "some-unique-constraint", "before": null, "after": { ... }, "attributes": { ... } } ] } Aggregating Change Events
  22. 22. What is a Change Event? 26 { "ts_utc" : 1659375300000, "attributes": { ... }, "data": [ { "operation": "CREATE", "source": { ... }, "transaction": { ... }, "key": "some-unique-constraint", "before": null, "after": { ... }, "attributes": { ... } } ] } Aggregating Change Events
  23. 23. Change Events Can Come From Anywhere 27 { "data": [ {"source": { ... }} ] }, { "data": [ {"source": { ... }} ] }, { "data": [ {"source": { ... }} ] }, Stream: charges Stream: audits Stream: disputes Aggregating Change Events
  24. 24. Databases Have Transactions 28 Aggregating Change Events BEGIN INSERT INTO charges UPDATE audits ... COMMIT
  25. 25. What is a Transaction Metadata Event? 29 // BEGIN Marker { "id" : "transaction-id", "ts_utc": 1659375300000, "marker": "BEGIN", "total_events": null, "per_source_event_counts": null, } // COMMIT Marker { "id" : "transaction-id", "ts_utc": 1659375300000, "marker": "COMMIT", "total_events": 3, "per_source_event_counts": [{ ... }], } Aggregating Change Events
  26. 26. What is a Transaction Metadata Event? 30 // BEGIN Marker { "id" : "transaction-id", "ts_utc": 1659375300000, "marker": "BEGIN", "total_events": null, "per_source_event_counts": null, } // COMMIT Marker { "id" : "transaction-id", "ts_utc": 1659375300000, "marker": "COMMIT", "total_events": 3, "per_source_event_counts": [{ ... }], } Aggregating Change Events
  27. 27. What is a Transaction Metadata Event? 31 // BEGIN Marker { "id" : "transaction-id", "ts_utc": 1659375300000, "marker": "BEGIN", "total_events": null, "per_source_event_counts": null, } // COMMIT Marker { "id" : "transaction-id", "ts_utc": 1659375300000, "marker": "COMMIT", "total_events": 3, "per_source_event_counts": [{ ... }], } Aggregating Change Events [ { "source" : "keyspace.table1", "total_events": 1, }, { "source" : "keyspace.table2", "total_events": 1, } ]
  28. 28. -- 4 events BEGIN -- Transaction Metadata Event INSERT INTO charges -- Change Event UPDATE audits ... -- Change Event COMMIT -- Transaction Metadata Event Putting It All Together 32 Aggregating Change Events
  29. 29. What is an Aggregated Change Event? 33 { "ts_utc" : 1659375300000, "data": [ { "operation": "CREATE", "transaction": { “id”: "txn1"}, "before": null, "after": { ... }, }, { "operation": "UPDATE", "transaction": { “id”: "txn1"}, "before": { ... }, "after": { ... }, }, ] } Aggregating Change Events
  30. 30. What is an Aggregated Change Event? 34 { "ts_utc" : 1659375300000, "data": [ { "operation": "CREATE", "transaction": { “id”: "txn1"}, "before": null, "after": { ... }, }, { "operation": "UPDATE", "transaction": { “id”: "txn1"}, "before": { ... }, "after": { ... }, }, ] } ● One transaction with two events having the same transaction ID. ● Events may arrive from an arbitrary number of tables. Aggregating Change Events
  31. 31. 35 Transaction Metadata Event Stream (one) Flat map Flink Job Graph Change Event Stream (many; one per table) Windowed Aggregation Side Output Aggregated Change Event Stream Aggregating Change Events
  32. 32. Multiple Sources 36 Union Join Connect Aggregating Change Events
  33. 33. Joins elements of the same key within the same window. ● Produces pairwise elements Join 37 time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT BEGIN COMMIT Event 3 Event 1 BEGIN , Event 1 COMMIT , Event 2 BEGIN , Event 2 COMMIT , Event 3 BEGIN , Event 3 COMMIT , Aggregating Change Events
  34. 34. Unions multiple streams of the same type into a single stream. ● Requires streams of the same type Union 38 38 time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT BEGIN COMMIT Event 3 (No output; won’t compile because streams are of different types) Aggregating Change Events
  35. 35. Connect 39 time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT BEGIN COMMIT Event 3 Event 1 BEGIN , Event 2 COMMIT , Event 3 BEGIN , COMMIT , , , Unions multiple streams, potentially of different types. ● Similar to Unions Aggregating Change Events
  36. 36. 40 Support for streams of different types Support for flexible stream combination semantics Don’t need pairwise outputs Aggregating Change Events What Do We Need?
  37. 37. Flink Job Definition 41 val mainStream = transactionMetadataEventStream // uid and name omitted. .connect(changeEventStream) // Union different types. Aggregating Change Events
  38. 38. 42 Transaction Metadata Event Stream (one) Flat map Flink Job Graph Change Event Stream (many; one per table) Windowed Aggregation Side Output Aggregated Change Event Stream Aggregating Change Events
  39. 39. Connected Streams 43 Custom Either Aggregating Change Events
  40. 40. Wraps an event containing one of two types, either from left or right stream. ● Out-of-box ● No concept of keys Either.left = Either.right = null Either 44 time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT BEGIN COMMIT Event 3 Event 1 BEGIN , Either.left = null Either.right = , … Aggregating Change Events
  41. 41. WrappedEvent.key = txn-1 WrappedEvent.left = null WrappedEvent.right = Custom 45 WrappedEvent.key = txn-1 WrappedEvent.left = WrappedEvent.right = null time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT BEGIN COMMIT Event 3 Event 1 BEGIN , , … Wraps an event containing one of two types, either from left or right stream, and a common key among both events. ● Small and simple code addition ● Need to extract keys Aggregating Change Events
  42. 42. 46 Wrap elements of a connected stream Be able to identify keys to support aggregations later Aggregating Change Events What Do We Need?
  43. 43. Flink Job Definition 47 val mainStream = transactionMetadataEventStream // uid and name omitted. .connect(changeEventStream) // Union different types. .flatMap(new WrappedEventFunction) // Like Either type, but with extra fields. .keyBy(_.key) // Group events with the same transaction ID. Aggregating Change Events
  44. 44. 48 Transaction Metadata Event Stream (one) Flat map Flink Job Graph Change Event Stream (many; one per table) Windowed Aggregation Side Output Aggregated Change Event Stream Aggregating Change Events
  45. 45. Aggregation Characteristics Arbitrary number of Change Event Streams One Transaction Metadata Event Stream Change Events must have the same transaction IDs Handle late arriving or duplicate Change Events and Transaction Metadata Events Don’t result in infinite state growth 49 Aggregating Change Events
  46. 46. Windowing 50 Session Sliding Tumbling Aggregating Change Events
  47. 47. Tumbling Windows 51 Assigns elements to windows of a fixed size. ● Windows don’t overlap time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT BEGIN COMMIT Event 3 Aggregating Change Events
  48. 48. Tumbling Windows 52 Assigns elements to windows of a fixed size. ● Windows don’t overlap time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT Aggregating Change Events
  49. 49. Tumbling Windows 53 Assigns elements to windows of a fixed size. ● Windows don’t overlap time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT ● Late-arriving events? Add delay. Aggregating Change Events
  50. 50. Tumbling Windows 54 Assigns elements to windows of a fixed size. ● Windows don’t overlap time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT ● Late-arriving events? Add delay. Aggregating Change Events
  51. 51. Tumbling Windows 55 Assigns elements to windows of a fixed size. ● Windows don’t overlap time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT ● Late-arriving events? Add delay. ● Large delay? Trade-off: Freshness vs Correctness. Aggregating Change Events
  52. 52. Tumbling Windows 56 Assigns elements to windows of a fixed size. ● Windows don’t overlap time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT ● Late-arriving events? Add delay. ● Large delay? Trade-off: Freshness vs Correctness. ● Not quite right… Aggregating Change Events
  53. 53. Sliding Windows 57 time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT BEGIN COMMIT Event 3 Assigns elements to windows of a fixed size, but with a slide interval. ● Almost like a tumbling window, but with windows overlapping Aggregating Change Events
  54. 54. Sliding Windows 58 time Change Events Transaction Metadata Events Event 1 Event 2 BEGIN COMMIT ● Late-arriving events? Same as tumbling windows. ● Slide interval? Explosion of windows ● Not quite right… Aggregating Change Events Assigns elements to windows of a fixed size, but with a slide interval. ● Almost like a tumbling window, but with windows overlapping
  55. 55. Session Windows 59 time Change Events Transaction Metadata Events Event 1 BEGIN COMMIT Event 2 BEGIN COMMIT Event 3 Aggregating Change Events Assigns elements that are seen relatively close to each other. ● Arbitrarily-sized windows; no fixed start and end ● Windows don’t overlap ● Windows close based on a defined gap of inactivity
  56. 56. Session Windows 60 time Change Events Transaction Metadata Events Event 1 BEGIN COMMIT Event 2 Assigns elements that are seen relatively close to each other. ● Arbitrarily-sized windows; no fixed start and end ● Windows don’t overlap ● Windows close based on a defined gap of inactivity Aggregating Change Events
  57. 57. Session Windows 61 time Change Events Transaction Metadata Events Event 1 BEGIN COMMIT Event 2 Assigns elements that are seen relatively close to each other. ● Arbitrarily-sized windows; no fixed start and end ● Windows don’t overlap ● Windows close based on a defined gap of inactivity Aggregating Change Events
  58. 58. Session Windows 62 time Change Events Transaction Metadata Events Event 1 BEGIN COMMIT Event 2 ● Session gap too small? Incomplete aggregates Assigns elements that are seen relatively close to each other. ● Arbitrarily-sized windows; no fixed start and end ● Windows don’t overlap ● Windows close based on a defined gap of inactivity Aggregating Change Events
  59. 59. Session Windows 63 time Change Events Transaction Metadata Events Event 1 BEGIN COMMIT Event 2 ● Session gap too small? Incomplete aggregates Assigns elements that are seen relatively close to each other. ● Arbitrarily-sized windows; no fixed start and end ● Windows don’t overlap ● Windows close based on a defined gap of inactivity Aggregating Change Events
  60. 60. Session Windows 64 time Change Events Transaction Metadata Events Event 1 BEGIN COMMIT Event 2 ● Session gap too small? Incomplete aggregates Assigns elements that are seen relatively close to each other. ● Arbitrarily-sized windows; no fixed start and end ● Windows don’t overlap ● Windows close based on a defined gap of inactivity Aggregating Change Events
  61. 61. Session Windows 65 time Change Events Transaction Metadata Events Event 1 BEGIN COMMIT Event 2 ● Session gap too small? Incomplete aggregates ● Session gap too big? Trade-off: Freshness vs Correctness Assigns elements that are seen relatively close to each other. ● Arbitrarily-sized windows; no fixed start and end ● Windows don’t overlap ● Windows close based on a defined gap of inactivity Aggregating Change Events
  62. 62. Session Windows 66 time Change Events Transaction Metadata Events Event 1 BEGIN COMMIT Event 2 ● Session gap too small? Incomplete aggregates ● Session gap too big? Trade-off: Freshness vs Correctness ● Not quite right… Assigns elements that are seen relatively close to each other. ● Arbitrarily-sized windows; no fixed start and end ● Windows don’t overlap ● Windows close based on a defined gap of inactivity Aggregating Change Events
  63. 63. Global Windows 67 Assigns elements to a single window. ● Only a single window per key ● Window never closes time Change Events Transaction Metadata Events Event 1 BEGIN COMMIT Event 2 BEGIN COMMIT Event 3 Aggregating Change Events
  64. 64. Global Windows 68 Assigns elements to a single window. ● Only a single window per key ● Window never closes time Change Events Transaction Metadata Events Event 1 BEGIN COMMIT Event 2 BEGIN COMMIT Event 3 ● Outputs never get evaluated and materialized ● Needs more… Aggregating Change Events
  65. 65. Global Windows + Custom Stateful Trigger 69 Assign elements to a Global Window and add a custom stateful trigger. ● Flexibly define open/close conditions for non- overlapping windows ● Reasonably handle late-arriving events ● Avoid infinite state growth and reduce likelihood of incomplete aggregates Aggregating Change Events
  66. 66. What Makes an Aggregation Complete? 70 Aggregating Change Events BEGIN transaction marker seen COMMIT transaction marker seen All Change Events of the transaction seen All Change Events are globally and locally ordered
  67. 67. Custom Stateful Trigger: TransactionBoundaryTrigger 71 if transaction metadata event: if begin transaction marker: update begin marker state else: update commit marker state update bitmap state using commit marker’s total event count set timeout state and register event time timer else: update bitmap state with change event’s global position set timeout state and register event time timer if should trigger(begin, commit, total events): clear window TriggerResult.FIRE_AND_PURGE else: TriggerResult.CONTINUE Reference Aggregating Change Events // ChangeEvent#transaction { "id" : "transaction-id", "global_position": 1, "source_position": 1, } // TransactionMetadataEvent { "id" : "transaction-id", "ts_utc": 1659375300000, "marker": "COMMIT", "total_events": 3, "per_source_event_counts": [{ ... }], }
  68. 68. val mainStream = transactionMetadataEventStream // uid and name omitted. .connect(changeEventStream) // Union different types. .flatMap(new WrappedEventFunction) // Like Either type, but with extra fields. .keyBy(_.key) // Group events with the same transaction ID. Flink Job Definition 72 .window(GlobalWindows.create) .trigger(new TransactionBoundaryTrigger(...)) // Flexible windowing semantics. .process(new KeyedProcessor(...)) Aggregating Change Events
  69. 69. 73 Transaction Metadata Event Stream (one) Flat map Flink Job Graph Change Event Stream (many; one per table) Windowed Aggregation Side Output Aggregated Change Event Stream Aggregating Change Events
  70. 70. val mainStream = transactionMetadataEventStream // uid and name omitted. .connect(changeEventStream) // Union different types. .flatMap(new WrappedEventFunction) // Like Either type, but with extra fields. .keyBy(_.key) // Group events with the same transaction ID. .window(GlobalWindows.create) .trigger(new TransactionBoundaryTrigger(...)) // Flexible windowing semantics. .process(new KeyedProcessor(...)) Flink Job Definition 74 mainStream // Side output to DLQ. .getSideOutput(...) .addSink(...) mainStream // Output aggregated change events. .addSink(...) Aggregating Change Events
  71. 71. 75 Agenda CDC at Stripe 1 Aggregating Change Events 2 How it Started, How it Ended 3 Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture Change Data Capture (CDC) is widely- used at Stripe to capture data changes from databases without critically impacting database reliability and scalability. CDC powers many critical financial use cases at Stripe such as the Stripe Dashboard, Stripe Search, Sigma, and Financial Reporting. From idea to production—things may seem straightforward at first, but the details matter. We detail our journey of how we leveraged Flink for Change Data Capture at Stripe in order to uphold the highest data quality standards. Freshness, Coverage, and Correctness SLOs are paramount to the success of platforms and applications running on top of our CDC infrastructure. Change Event Streams are ubiquitous across Stripe given the vast number of applications and employees generating datasets worldwide. Change Event Streams are independent from one another which leads to the typical challenges in distributed systems. One of the major use cases revolves around aggregating individual change events of a database transaction to support Stripe’s payments infrastructure.
  72. 72. From Idea to Production 76 Coverage Platform State How it Started, How it Ended
  73. 73. State 77 How it Started, How it Ended
  74. 74. How It Started
  75. 75. How It Started How It Ended
  76. 76. Infinite keys due to continuous stream of new transactions Observations 80 How it Started, How it Ended Using a Global Window; possible windows not closing properly No trigger timeouts firing No watermarks being generated
  77. 77. Idle Sub Tasks Observations 81 charges (partitions = 2) Transaction Metadata Events audits (partitions = 1) disputes (partitions = 1) Source Sub Tasks How it Started, How it Ended
  78. 78. Fix 82 Fixed an upstream issue where transaction IDs were getting mixed up Reduce parallelism on Source Sub Tasks for all streams Make sure parallelism ≤ ∑ Topic Partitions Generally, check with SplitEnumerator classes How it Started, How it Ended
  79. 79. How It Started
  80. 80. How It Started How It Ended
  81. 81. State size still growing, but slower Observations 85 How it Started, How it Ended Event time timers firing, sometimes Watermarks are being generated, but not for all sub tasks
  82. 82. New Observations 86 charges (partitions = 2) Transaction Metadata Events audits (partitions = 1) disputes (partitions = 1) Source Sub Tasks Low volume stream How it Started, How it Ended
  83. 83. Possible Fix 87 Switch from event time to processing time Less precise Could cause premature trigger firing, resulting in incomplete aggregates How it Started, How it Ended
  84. 84. Actual Fix 88 Add idleness property on sources Can still use event time More precise Not perfect; can still result in incomplete aggregates in edge cases That’s the reality of streaming How it Started, How it Ended
  85. 85. Platform 89 How it Started, How it Ended
  86. 86. How It Started
  87. 87. How It Started How It Ended
  88. 88. Don’t want to redeploy every time a new dataset (Kafka Topic) is added Observations 92 How it Started, How it Ended Blows away Freshness SLO’s error budget Poor developer onboarding experience
  89. 89. Fix 93 Instead of Kafka Topic List Subscriber, use Regex Subscriber Subscribe to all topics (for a keyspace) by default Control plane (external) service produces an event to Broadcast Stream On broadcast element, use Broadcast State to keep onboarded datasets in state On element, check Broadcast State and filter for onboarded datasets How it Started, How it Ended
  90. 90. Coverage 94 How it Started, How it Ended
  91. 91. How It Started
  92. 92. How It Started How It Ended
  93. 93. Observations Incomplete aggregates still happening, but not frequently 97 How it Started, How it Ended Kafka by default is at-least-once delivery Many independent streams operating at different speeds
  94. 94. Storage will be expensive. Trade-off between confidence and cost- efficiency: KV store or bloom filter Move incomplete aggregate measurement out of the Flink Job and into a system downstream Fix 98 How it Started, How it Ended New system needs to dedupe events… for all time?
  95. 95. How It Started
  96. 96. How It Started How It Ended
  97. 97. 101 Agenda CDC at Stripe 1 Aggregating Change Events 2 How it Started, How it Ended 3 Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture Change Data Capture (CDC) is widely- used at Stripe to capture data changes from databases without critically impacting database reliability and scalability. CDC powers many critical financial use cases at Stripe such as the Stripe Dashboard, Stripe Search, Sigma, and Financial Reporting. From idea to production – things may seem straightforward at first, but the details matter. We detail our journey of how we leveraged Flink for Change Data Capture at Stripe in order to uphold the highest data quality standards. Freshness, Coverage, and Correctness SLOs are paramount to the success of platforms and applications running on top of our CDC infrastructure. Change Event Streams are ubiquitous across Stripe given the vast number of applications and employees generating datasets worldwide. Change Event Streams are independent from one another which leads to the typical challenges in distributed systems. One of the major use cases revolves around aggregating individual change events of a database transaction to support Stripe’s payments infrastructure.
  98. 98. Aggregating Change Events is relatively straightforward, but the details matter Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture Wrap Up 102 Change Data Capture (CDC) is widely-used at Stripe to improve database reliability and scalability Flink is a critical component in Stripe’s CDC infrastructure that allows us to work with financial streaming data with high data quality guarantees
  99. 99. Thank you! 103 Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Capture

Editor's Notes

  • What is Stripe? Who is it for?
  • At what scale? $640B annual in payment volume. Challenging…
  • Many products, many apps and services, many datasets.
  • Across many databases of different types. Mongo, MySQL. Multi-region, databases have many shards which are split as volume grows.
  • Watermarks per partition, not per key. Perhaps note an upstream issue, nonetheless, could have manifested by testing out late events.
  • Watermark = min parallelism
  • Keys can go to the same partition, one key could be late, another could not. Watermark will progress. Timeout will fire - incomplete aggregate. Late key comes in and is treated as incomplete aggregate again.
  • Connect with broadcast stream.
    processElement -> check broadcast state
    processBroadcastElement -> update state
  • Union or join. Streams are independent and any one stream can have duplicate. If duplicate, will result in incomplete aggregate for that key. It won’t unless all streams have the same number of duplicates for that key, but unlikely.
    Imagine an aggregate was just completed for a key. Then, dup happens and event sits in state until timed out.

×