Advertisement
Advertisement

More Related Content

Similar to Building High Fidelity Data Streams (QCon London 2023)(20)

Advertisement

Building High Fidelity Data Streams (QCon London 2023)

  1. Building High-Fidelity Data Streams
  2. Some Historical Context (Let’s go back to 2017)
  3. Some Historical Context Circa Dec 2017, I was tasked with building PayPal’s Change Data Capture (CDC) system
  4. Some Historical Context Circa Dec 2017, I was tasked with building PayPal’s Change Data Capture (CDC) system CDC provides a streaming alternative to repeatedly querying a database! It’s an ideal replacement for “polling”-style query workloads like the following!
  5. Some Historical Context select * from transaction where account_id=<x> and txn_date between <t1> and <t2>
  6. select * from transaction where account_id=<x> and txn_date between <t1> and <t2> PayPal Activity Feed A query like this powers the PayPal user activity feed! Some Historical Context
  7. select * from transaction where account_id=<x> and txn_date between <t1> and <t2> The query above will need to be run very frequently to meet users’ timeliness expectations! PayPal Activity Feed A query like this powers the PayPal user activity feed! Users expect to see a completed transaction in their feed as soon as they complete a money transfer, crypto purchase, or merchant checkout! Some Historical Context
  8. Some Historical Context As PayPal traf fi c grew, queries like this one became more expensive to run and presented a burden on DB capacity!
  9. Some Historical Context As PayPal traf fi c grew, queries like this one became more expensive to run and presented a burden on DB capacity!
  10. Some Historical Context There was a strong interest in of fl oading DB traf fi c to something like CDC However, client teams wanted guarantees around non-functional requirements (NFRs) similar to what they had with DBs 3 nines (99.9%) Availability Sub-second data latency (P95) Scalability No off-the-shelf solutions were available to meet these requirements! Reliability - i.e. No data loss!
  11. (…So, we built our own solution!) The Core Data Highway (Change Data Capture @ PayPal)
  12. Some Historical Context If building our own solution, we knew that we needed to make NFRs fi rst class citizens! e.g. scalability, availability, reliability/data quality, observability, etc…
  13. Some Historical Context When outages occur, resolving incidents come with an added time pressure since any data lag is very visible and impacts customers! In streaming systems, any gaps in NFRs can be unforgiving If building our own solution, we knew that we needed to make NFRs fi rst class citizens! e.g. scalability, availability, reliability/data quality, observability, etc…
  14. Some Historical Context If you don’t build your systems with the -ilities as fi rst class citizens, you pay a steep operational tax In streaming systems, any gaps in NFRs can be unforgiving If building our own solution, we knew that we needed to make NFRs fi rst class citizens! e.g. scalability, availability, reliability/data quality, observability, etc… When outages occur, resolving incidents come with an added time pressure since any data lag is very visible and impacts customers!
  15. Core Data Highway • 70+ RAC primaries • 70k+ tables • 20-30 PBs of data
  16. Back to the more Recent Past (Circa 2020)
  17. Moving to the Cloud I joined Datazoom, a company specializing in video telemetry collection! But, with a few key differences! Built on bare-metal in PP DCs Deep Pockets Internal Customers Build for the public cloud Shoestring (Seed Funded) External Customers Datazoom wanted similar streaming delivery guarantees as PayPal! PayPal Datazoom
  18. Building High Fidelity Data Streams! Now, that we have some context…. Let’s build a high- fi delity stream system from the ground up!
  19. Start Simple
  20. Start Simple Goal : Build a system that can deliver messages from source S to destination D S D
  21. Start Simple Goal : Build a system that can deliver messages from source S to destination D S D But fi rst, let’s decouple S and D by putting messaging infrastructure between them E S D Events topic
  22. Start Simple Make a few more implementation decisions about this system E S D
  23. Start Simple Make a few more implementation decisions about this system E S D Run our system on a cloud platform (e.g. AWS)
  24. Start Simple Make a few more implementation decisions about this system Operate at low scale E S D Run our system on a cloud platform (e.g. AWS)
  25. Start Simple Make a few more implementation decisions about this system Operate at low scale Kafka with a single partition E S D Run our system on a cloud platform (e.g. AWS)
  26. Start Simple Make a few more implementation decisions about this system Operate at low scale Kafka with a single partition Kafka across 3 brokers split across AZs with RF=3 (min in-sync replicas =2) E S D Run our system on a cloud platform (e.g. AWS)
  27. Start Simple Make a few more implementation decisions about this system Operate at low scale Kafka with a single partition Kafka across 3 brokers split across AZs with RF=3 (min in-sync replicas =2) Run S & D on single, separate EC2 Instances E S D Run our system on a cloud platform (e.g. AWS)
  28. Start Simple T o make things a bit more interesting, let’s provide our stream as a service We de fi ne our system boundary using a blue box as shown below! È S D
  29. Reliability (Is This System Reliable?)
  30. Reliability Goal : Build a system that can deliver messages reliably from S to D È S D
  31. Reliability Goal : Build a system that can deliver messages reliably from S to D È S D Concrete Goal : 0 message loss
  32. Reliability Goal : Build a system that can deliver messages reliably from S to D È S D Concrete Goal : 0 message loss Once S has ACKd a message to a remote sender, D must deliver that message to a remote receiver
  33. Reliability How do we build reliability into our system? È S D
  34. Reliability Let’s fi rst generalize our system! ` A B C m1 m1
  35. Reliability In order to make this system reliable ` A B C m1 m1
  36. Reliability ` A B C m1 m1 T reat the messaging system like a chain — it’s only as strong as its weakest link In order to make this system reliable
  37. Reliability ` A B C m1 m1 T reat the messaging system like a chain — it’s only as strong as its weakest link Insight : If each process/link is transactional in nature, the chain will be transactional! In order to make this system reliable
  38. Reliability ` A B C m1 m1 T reat the messaging system like a chain — it’s only as strong as its weakest link In order to make this system reliable Insight : If each process/link is transactional in nature, the chain will be transactional! T ransactionality = At least once delivery
  39. Reliability ` A B C m1 m1 T reat the messaging system like a chain — it’s only as strong as its weakest link How do we make each link transactional? In order to make this system reliable Insight : If each process/link is transactional in nature, the chain will be transactional! T ransactionality = At least once delivery
  40. Reliability Let’s fi rst break this chain into its component processing links B̀ m1 m1 ` A m1 m1 ` C m1 m1
  41. Reliability Let’s fi rst break this chain into its component processing links B̀ m1 m1 ` A m1 m1 ` C m1 m1 A is an ingest node
  42. Reliability Let’s fi rst break this chain into its component processing links B̀ m1 m1 ` A m1 m1 ` C m1 m1 B is an internal node
  43. Reliability Let’s fi rst break this chain into its component processing links B̀ m1 m1 ` A m1 m1 ` C m1 m1 C is an expel node
  44. Reliability But, how do we handle edge nodes A & C? B̀ m1 m1 ` A m1 m1 ` C m1 m1 What does A need to do? • Receive a Request (e.g. REST) • Do some processing • Reliably send data to Kafka • kProducer.send(topic, message) • kProducer. fl ush() • Producer Con fi g • acks = all • Send HTTP Response to caller
  45. Reliability But, how do we handle edge nodes A & C? B̀ m1 m1 ` A m1 m1 ` C m1 m1 What does C need to do? • Read data (a batch) from Kafka • Do some processing • Reliably send data out • ACK / NACK Kafka • Consumer Con fi g • enable.auto.commit = false • ACK moves the read checkpoint forward • NACK forces a reread of the same data
  46. Reliability But, how do we handle edge nodes A & C? B̀ m1 m1 ` A m1 m1 ` C m1 m1 B is a combination of A and C
  47. Reliability But, how do we handle edge nodes A & C? B̀ m1 m1 ` A m1 m1 ` C m1 m1 B is a combination of A and C B needs to act like a reliable Kafka Producer
  48. Reliability But, how do we handle edge nodes A & C? B̀ m1 m1 ` A m1 m1 ` C m1 m1 B is a combination of A and C B needs to act like a reliable Kafka Producer B needs to act like a reliable Kafka Consumer
  49. ` A B C m1 m1 Reliability How reliable is our system now?
  50. Reliability How reliable is our system now? What happens if a process crashes? ` A B C m1 m1
  51. Reliability How reliable is our system now? What happens if a process crashes? If A crashes, we will have a complete outage at ingestion! ` A B C m1 m1
  52. Reliability How reliable is our system now? If C crashes, we will stop delivering messages to external consumers! What happens if a process crashes? If A crashes, we will have a complete outage at ingestion! ` A B C m1 m1
  53. Reliability ` A B C m1 m1 Solution : Place each service in an autoscaling group of size T ` A B C m1 m1 T-1 concurrent failures
  54. Reliability ` A B C m1 m1 Solution : Place each service in an autoscaling group of size T ` A B C m1 m1 T-1 concurrent failures For now, we appear to have a pretty reliable data stream
  55. But how do we measure its reliability?
  56. Observability (A story about Lag & Loss Metrics) (This brings us to …)
  57. Lag : What is it?
  58. Lag : What is it? Lag is simply a measure of message delay in a system
  59. Lag : What is it? Lag is simply a measure of message delay in a system The longer a message takes to transit a system, the greater its lag
  60. Lag : What is it? Lag is simply a measure of message delay in a system The longer a message takes to transit a system, the greater its lag The greater the lag, the greater the impact to the business
  61. Lag : What is it? Lag is simply a measure of message delay in a system The longer a message takes to transit a system, the greater its lag The greater the lag, the greater the impact to the business Hence, our goal is to minimize lag in order to deliver insights as quickly as possible
  62. Lag : How do we compute it?
  63. Lag : How do we compute it? eventTime : the creation time of an event message Lag can be calculated for any message m1 at any node N in the system as lag(m1, N) = current_time(m1, N) - eventTime(m1) ` A B C m1 m1 T0 eventTime:
  64. Lag : How do we compute it? ` A B C T0 = 12p eventTime: m1
  65. Lag : How do we compute it? ` A B C T1 = 12:01p T0 = 12p eventTime: m1
  66. Lag : How do we compute it? ` A B C T1 = 12:01p T3 = 12:04p T0 = 12p eventTime: m1
  67. Lag : How do we compute it? ` A B C T1 = 12:01p T3 = 12:04p T5 = 12:10p T0 = 12p eventTime: m1 m1
  68. Lag : How do we compute it? ` A B C T1 = 12:01p T3 = 12:04p T5 = 12:10p T0 = 12p eventTime: m1 m1 lag(m1, A) = T1-T0 lag(m1, B) = T3-T0 lag(m1, C) = T5-T0 lag(m1, A) = 1m lag(m1, B) = 4m lag(m1, C) = 10m
  69. ` A B C Arrival Lag (Lag-in): time message arrives - eventTime T1 T3 T5 T0 eventTime: m1 m1 Lag : How do we compute it? Lag-in @ A = T1 - T0 (e.g 1 ms) B = T3 - T0 (e.g 3 ms) C = T5 - T0 (e.g 8 ms)
  70. ` A B C Arrival Lag (Lag-in): time message arrives - eventTime T1 T3 T5 T0 eventTime: m1 m1 Lag : How do we compute it? Lag-in @ A = T1 - T0 (e.g 1 ms) B = T3 - T0 (e.g 3 ms) C = T5 - T0 (e.g 8 ms) A B C Observation: Lag is Cumulative
  71. Lag : How do we compute it? Lag-in @ A = T1 - T0 (e.g 1 ms) B = T3 - T0 (e.g 3 ms) C = T5 - T0 (e.g 8 ms) Lag-out @ A = T2 - T0 (e.g 2 ms) B = T4 - T0 (e.g 4 ms) C = T6 - T0 (e.g 10 ms) ` A B C Arrival Lag (Lag-in): time message arrives - eventTime T1 T3 T5 Departure Lag (Lag-out): time message leaves - eventTime T2 T4 T6 T0 eventTime: m1
  72. Lag : How do we compute it? Lag-in @ A = T1 - T0 (e.g 1 ms) B = T3 - T0 (e.g 3 ms) C = T5 - T0 (e.g 8 ms) Lag-out @ A = T2 - T0 (e.g 2 ms) B = T4 - T0 (e.g 4 ms) C = T6 - T0 (e.g 10 ms) ` A B C Arrival Lag (Lag-in): time message arrives - eventTime T1 T3 T5 Departure Lag (Lag-out): time message leaves - eventTime T2 T4 T0 eventTime: m1 E2E Lag The most important metric for Lag in any streaming system is E2E Lag: the total time a message spent in the system T6
  73. Lag : How do we compute it? While it is interesting to know the lag for a particular message m1, it is of little use since we typically deal with millions of messages
  74. Lag : How do we compute it? While it is interesting to know the lag for a particular message m1, it is of little use since we typically deal with millions of messages Instead, we prefer statistics (e.g. P95) to capture population behavior
  75. Lag : How do we compute it? Some useful Lag statistics are: E2E Lag (p95) : 95th percentile time of messages spent in the system Lag_[in|out](N, p95): P95 Lag_in or Lag_out at any Node N
  76. Lag : How do we compute it? Some useful Lag statistics are: E2E Lag (p95) : 95th percentile time of messages spent in the system Lag_[in|out](N, p95): P95 Lag_in or Lag_out at any Node N Process_Duration(N, p95) : Time Spent at any node in the chain! Lag_out(N, p95) - Lag_in(N, p95)
  77. m1 m1 Process_Duration Graphs show you the contribution to overall Lag from each hop Lag : How do we compute it?
  78. Loss : What is it?
  79. Loss : What is it? Loss is simply a measure of messages lost while transiting the system
  80. Loss : What is it? Loss is simply a measure of messages lost while transiting the system Messages can be lost for various reasons, most of which we can mitigate!
  81. Loss : What is it? Loss is simply a measure of messages lost while transiting the system Messages can be lost for various reasons, most of which we can mitigate! The greater the loss, the lower the data quality
  82. Loss : What is it? Loss is simply a measure of messages lost while transiting the system Messages can be lost for various reasons, most of which we can mitigate! The greater the loss, the lower the data quality Hence, our goal is to minimize loss in order to deliver high quality insights
  83. Loss : How do we compute it?
  84. Loss : How do we compute it? Concepts : Loss Loss can be computed as the set difference of messages between any 2 points in the system m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1
  85. Loss : How do we compute it? Message Id E2E Loss E2E Loss % m1 1 1 1 1 m2 1 1 1 1 m3 1 0 0 0 … … … … … m10 1 1 0 0 Count 10 9 7 5 Per Node Loss(N) 0 1 2 2 5 50% m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1
  86. Loss : How do we compute it? In a streaming data system, messages never stop fl owing. So, how do we know when to count? m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m11 m1 m1 m1 m1 m1 m21 m1 m1 m1 m1 m1 m31
  87. Loss : How do we compute it? In a streaming data system, messages never stop fl owing. So, how do we know when to count? Solution Allocate messages to 1-minute wide time buckets using message eventTime m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m11 m1 m1 m1 m1 m1 m21 m1 m1 m1 m1 m1 m31
  88. Loss : How do we compute it? Message Id E2E Loss E2E Loss % m1 1 1 1 1 m2 1 1 1 1 m3 1 0 0 0 … … … … … m10 1 1 0 0 Count 10 9 7 5 Per Node Loss(N) 0 1 2 2 5 50% m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 @12:34p
  89. Loss : How do we compute it? m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 @12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p
  90. Loss : How do we compute it? m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 @12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p Now
  91. Loss : How do we compute it? m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 @12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p Now Update late arrival data
  92. Loss : How do we compute it? m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 @12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p Now Update late arrival data Compute Loss
  93. Loss : How do we compute it? m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 @12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p Now Update late arrival data Compute Loss Age Out
  94. Loss : How do we compute it? In a streaming data system, messages never stop fl owing. So, how do we know when to count? Solution Allocate messages to 1-minute wide time buckets using message eventTime Wait a few minutes for messages to transit, then compute loss (e.g. 12:35p loss table) Raise alarms if loss occurs over a con fi gured threshold (e.g. > 1%) m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m11 m1 m1 m1 m1 m1 m21 m1 m1 m1 m1 m1 m31
  95. We now have a way to measure the reliability (via Loss metrics) and latency (via Lag metrics) of our system. Loss : How do we compute it? But wait…
  96. Performance (have we tuned our system for performance yet??)
  97. Performance Goal : Build a system that can deliver messages reliably from S to D with low latency S S S S S S S D S S S … SS S … T o understand streaming system performance, let’s understand the components of E2E Lag
  98. Performance Ingest Time S S S S S S S D S S S … S S S … Ingest Time : Time from Last_Byte_In_of_Request to First_Byte_Out_of_Response
  99. Performance Ingest Time S S S S S S S D S S S … S S S … • This time includes overhead of reliably sending messages to Kafka Ingest Time : Time from Last_Byte_In_of_Request to First_Byte_Out_of_Response
  100. Performance Ingest Time Expel Time S S S S S S S D S S S … S S S … Expel Time : Time to process and egest a message at D. This includes time to wait for ACKs from external services!
  101. Performance E2E Lag Ingest Time Expel Time T ransit Time S S S S S S S D S S S … S S S … E2E Lag : T otal time messages spend in the system from message ingest to expel!
  102. Performance Penalties (T rade off: Latency for Reliability)
  103. Performance Challenge 1 : Ingest Penalty In the name of reliability, S needs to call kProducer. fl ush() on every inbound API request S also needs to wait for 3 ACKS from Kafka before sending its API response E2E Lag Ingest Time Expel Time T ransit Time S S S S S S S D S S S … SS S …
  104. Performance Challenge 1 : Ingest Penalty Approach : Amortization Support Batch APIs (i.e. multiple messages per web request) to amortize the ingest penalty E2E Lag Ingest Time Expel Time T ransit Time S S S S S S S D S S S … SS S …
  105. Performance E2E Lag Ingest Time Expel Time T ransit Time Challenge 2 : Expel Penalty Observations Kafka is very fast — many orders of magnitude faster than HTTP RTT s The majority of the expel time is the HTTP RTT S S S S S S S D S S S … SS S …
  106. Performance E2E Lag Ingest Time Expel Time T ransit Time Challenge 2 : Expel Penalty Approach : Amortization In each D node, add batch + parallelism S S S S S S S D S S S … SS S …
  107. Performance Challenge 3 : Retry Penalty (@ D) Concepts In order to run a zero-loss pipeline, we need to retry messages @ D that will succeed given enough attempts
  108. Performance Challenge 3 : Retry Penalty (@ D) Concepts In order to run a zero-loss pipeline, we need to retry messages @ D that will succeed given enough attempts We call these Recoverable Failures
  109. Performance Challenge 3 : Retry Penalty (@ D) Concepts In order to run a zero-loss pipeline, we need to retry messages @ D that will succeed given enough attempts We call these Recoverable Failures In contrast, we should never retry a message that has 0 chance of success! We call these Non-Recoverable Failures
  110. Performance Challenge 3 : Retry Penalty (@ D) Concepts In order to run a zero-loss pipeline, we need to retry messages @ D that will succeed given enough attempts We call these Recoverable Failures In contrast, we should never retry a message that has 0 chance of success! We call these Non-Recoverable Failures E.g. Any 4xx HTTP response code, except for 429 (T oo Many Requests)
  111. Performance Challenge 3 : Retry Penalty Approach We pay a latency penalty on retry, so we need to smart about What we retry — Don’t retry any non-recoverable failures How we retry
  112. Performance Challenge 3 : Retry Penalty Approach We pay a latency penalty on retry, so we need to smart about What we retry — Don’t retry any non-recoverable failures How we retry — One Idea : Tiered Retries
  113. Performance - Tiered Retries Local Retries T ry to send message a con fi gurable number of times @ D Global Retries
  114. Performance - Tiered Retries Local Retries T ry to send message a con fi gurable number of times @ D If we exhaust local retries, D transfers the message to a Global Retrier Global Retries
  115. Performance - Tiered Retries Local Retries T ry to send message a con fi gurable number of times @ D If we exhaust local retries, D transfers the message to a Global Retrier Global Retries The Global Retrier than retries the message over a longer span of time
  116. ` E S S S S S S S D RO RI S S S gr Performance - 2 Tiered Retries RI : Retry_In RO : Retry_Out
  117. At this point, we have a system that works well at low scale Performance
  118. (But how does this system scale with increasing traf fi c?) Scalability
  119. Scalability Goal : Build a system that can deliver messages reliably from S to D with low latency up to a scale limit
  120. Scalability Goal : Build a system that can deliver messages reliably from S to D with low latency up to a scale limit What do I mean by a scale limit?
  121. Scalability First, Let’s dispel a myth!
  122. Scalability First, Let’s dispel a myth! There is no such thing as a system that can handle in fi nite scale
  123. Scalability First, Let’s dispel a myth! Every system is capacity limited — in the case of AWS, some of these are arti fi cial! Capacity limits & performance bottlenecks can only be revealed under load There is no such thing as a system that can handle in fi nite scale
  124. Scalability First, Let’s dispel a myth! Every system is capacity limited — in the case of AWS, some of these are arti fi cial! Capacity limits & performance bottlenecks can only be revealed under load There is no such thing as a system that can handle in fi nite scale Hence, we need to periodically load test our system at increasingly higher load to fi nd and remove performance bottlenecks Each successful test raises our scale rating!
  125. Scalability First, Let’s dispel a myth! Every system is capacity limited — in the case of AWS, some of these are arti fi cial! Capacity limits & performance bottlenecks can only be revealed under load There is no such thing as a system that can handle in fi nite scale Hence, we need to periodically load test our system at increasingly higher load to fi nd and remove performance bottlenecks By executing these tests ahead of organic load, we avoid customer-impacting message delays! Each successful test raises our scale rating!
  126. Scalability How do we set a target scale rating for our load tests? Desired Scale Rating: 1 MM events/second @ ingest
  127. Scalability Is it enough to only specify the target throughput? How do we set a target scale rating for our load tests? Desired Scale Rating: 1 MM events/second @ ingest
  128. So, a scale rating must be quali fi ed by a latency condition to be useful! Scalability Is it enough to only specify the target throughput? No Most systems experience increasing latency with increasing traf fi c! A scale rating which violates our E2E lag SLA is not useful! How do we set a target scale rating for our load tests? Desired Scale Rating: 1 MM events/second @ ingest
  129. So, a scale rating must be quali fi ed by a latency condition to be useful! Scalability Is it enough to only specify the target throughput? No Most systems experience increasing latency with increasing traf fi c! A scale rating which violates our E2E lag SLA is not useful! Updated Scale Rating: 1 MM events/second with <1s (p95) E2E lag How do we set a target scale rating for our load tests? Desired Scale Rating: 1 MM events/second @ ingest
  130. Scalability But, how do we select a scale rating? What should it be based on?
  131. Scalability But, how do we select a scale rating? What should it be based on? We maintain traf fi c alarms that tell us when our organic traf fi c exceeds 1/m of our previous scale rating This tells us that a new test needs to be scheduled! @ Datazoom, we target a scale rating that is a multiple m of our organic peak! We spend the duration of our load tests diagnosing system performance and identifying bottlenecks that result in higher than expected latency! Once we resolve all bottlenecks, we update our traf fi c alarms for our new higher scale rating based on a new target multiple m’
  132. Scalability - Autoscaling Autoscaling Goals (for data streams): Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag) Goal 2: Automatically scale in to minimize cost
  133. Scalability - Autoscaling Autoscaling Goals (for data streams): Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag) Goal 2: Automatically scale in to minimize cost
  134. Scalability - Autoscaling Autoscaling Goals (for data streams): Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag) Goal 2: Automatically scale in to minimize cost Autoscaling Considerations What can autoscale? What can’t autoscale?
  135. Scalability - Autoscaling Autoscaling Goals (for data streams): Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag) Goal 2: Automatically scale in to minimize cost Autoscaling Considerations What can autoscale? What can’t autoscale? MSK Serverless aims to bridge this, but it’s expensive!
  136. Scalability - Autoscaling @ Datazoom, we use K8S — this gives us all of the bene fi ts of containerization! But since we run on the cloud, K8s pods need to scale on top of EC2, which itself can can be autoscaled! Hence, we have a 2-level autoscaling problem! EC2 EC2 EC2
  137. Scalability - HPA Journey K8s : HPA a.k.a. Horizontal Pod Autoscaler (CPU) EC2 : EC2 autoscaling (Memory) CloudWatch (CW) provides the scaling signal First Attempt : Independently Scale EC2 & K8s pods Nov 25, 2020 : Major Kinesis outage in US-East-1 EC2 EC2 EC2 HPA EC2 CW
  138. Scalability - HPA Journey Datazoom doesn’t directly use Kinesis, but CloudWatch does! Nov 25, 2020 : Major Kinesis outage in US-East-1 EC2 EC2 EC2 HPA EC2 CW EC2 EC2 EC2 HPA EC2 CW K8s scaled (in) to Min Pod counts, resulting in a gray failure (high lag & some message loss)
  139. Scalability - HPA Journey Switching to K8s Event Driven Autoscaler handles this situation Also switched to using K8S Cluster Autoscaler coordinate EC2 Autoscaling with HPA! EC2 EC2 EC2 HPA EC2 CW KEDA
  140. Scalability At this point, we have a system that works well within a traf fi c or scale rating
  141. Availability (But, what can we say about its availability??)
  142. Availability What does Availability mean?
  143. Availability What does Availability mean? Availability measures the consistency with which a service meets the expectations of its users
  144. Availability What does Availability mean? Availability measures the consistency with which a service meets the expectations of its users For example, if I want to ssh into a machine, I want it to be running However, if its performance is degraded (e.g. low available memory), uptime is not enough! Any command I issue will take too long to complete
  145. Availability Gray Zone Down Up • 0 message loss • E2E Lag (p95)<= SLA (e.g. 1s) • Either message loss or the system is not accepting new messages • E2E Lag (p95)>>> SLA
  146. Availability Down Up • 0 message loss • E2E Lag (p95)<= SLA (e.g. 1s) • Either message loss or the system is not accepting new messages • E2E Lag (p95)> SLA
  147. Availability Down Up • 0 message loss • E2E Lag (p95)<= SLA (e.g. 1s) • Either message loss or the system is not accepting new messages • E2E Lag (p95)> SLA Let’s use this approach to de fi ne an Availability SLA for Data Streams!
  148. Availability If no, then this 1 minute period is out of Lag SLAs. In the parlance of the “nines” model, this would constitute a minute of downtime! There are 1440 minutes in a day For a given minute (e.g 12:04p), calculate the E2E lag (p95) for all events whose event_time falls in the 12:04p minute! If E2E lag must be < 1 second, compare this statistic to 1 second! Is E2E_Lag(p95) < 1 second ? If yes, then this 1 minute period is within Lag SLAs Consider that …
  149. Availability You may be familiar with the “nines of availability” chart below: Availability % Downtime per year Downtime per quarter Downtime per month Downtime per week Downtime per day (24 hours) 99% ("two nines") 3.65 days 21.9 hours 7.31 hours 1.68 hours 14.40 minutes 99.5% ("two and a half nines") 1.83 days 10.98 hours 3.65 hours 50.40 minutes 7.20 minutes 99.9% ("three nines") 8.77 hours 2.19 hours 43.83 minutes 10.08 minutes 1.44 minutes 99.95% ("three and a half nines") 4.38 hours 65.7 minutes 21.92 minutes 5.04 minutes 43.20 seconds 99.99% ("four nines") 52.60 minutes 13.15 minutes 4.38 minutes 1.01 minutes 8.64 seconds
  150. Availability The next question we must ask about our system is whether the downtime SLAs should be computed on a daily, weekly, monthly, etc… basis Availability % Downtime per year Downtime per quarter Downtime per month Downtime per week Downtime per day (24 hours) 99.9% ("three nines") 8.77 hours 2.19 hours 43.83 minutes 10.08 minutes 1.44 minutes For 3 nines’ uptime computed daily, no more than 1.44 minutes of downtime is allowed!
  151. Availability Can we enforce availability? No! Availability is intrinsic to a system - it is impacted by calamities, many of which are out of our control! The best we can do is to measure & monitor it!
  152. Availability We now have a system that reliably delivers messages from source S to destination D at low latency below its scale rating. We also have a way to measure its availability! Can we enforce availability? No! Availability is intrinsic to a system - it is impacted by calamities, many of which are out of our control! The best we can do is to measure & monitor it!
  153. • We have a system with the NFRs that we desire! Availability
  154. Practical Challenges (What are some practical challenges that we face?)
  155. Practical Challenges For example, what factors in fl uence latency in the real world? Aside from building a lossless system, the next biggest challenge is keeping latency low!
  156. Practical Challenges Minimizing latency is a practice in noise reduction!
  157. Practical Challenges Minimizing latency is a practice in noise reduction! The following factors impact latency in a streaming system! Network lookups needed for processing Autoscaling - Kafka Rebalancing AWS Maintenance T asks (e.g. Kafka) Bad Releases
  158. Practical Challenges - Latency Consider a modern pipeline as shown below Ingest Normalize Enrich Route T ransform T ransmit A very common practice is to forbid any network lookups along the stream processing chain above! However, most of the services above need to look up some con fi guration data as part of processing E.g. Enrichment, Router, T ransformer, T ransmitter
  159. Practical Challenges - Latency A common approach would be to pull data from a central cache such as redis Ingest Normalize Enrich Route T ransform T ransmit If lookups can be achieved in 1-2 ms, this lookup penalty may be acceptable. Get
  160. Practical Challenges - Latency In practice, we observe 1 minute outages whenever failovers occur in Redis HA Clusters! Ingest Normalize Enrich Route T ransform T ransmit T o shield ourselves against any latency gaps, we adopt some caching practices! Get Failover
  161. Practical Challenges - Latency Practice 1 : Adopt Caffeine (Local Cache) across all services • Source : https://github.com/ben-manes/caffeine • Caffeine is a high performance, near optimal caching library. Caffeine<Object, Object> cache = Caffeine.newBuilder() .refreshAfterWrite(<Staleness Interval>, TimeUnit) .expireAfterWrite( <Expiration Interval>, TimeUnit) .maximumSize( maxNumOfEntries );
  162. Practical Challenges - Latency Load a cache entry Time Caffeine<Object, Object> cache = Caffeine.newBuilder() .refreshAfterWrite(2, TimeUnit.MINUTES) .expireAfterWrite( 30, TimeUnit.DAYS) .maximumSize( maxNumOfEntries ); v1
  163. Practical Challenges - Latency Load a cache entry refreshAfter Write = 2 mins Time expireAfter Write = ∞ Caffeine<Object, Object> cache = Caffeine.newBuilder() .refreshAfterWrite(2, TimeUnit.MINUTES) .expireAfterWrite( 30, TimeUnit.DAYS) .maximumSize( maxNumOfEntries ); v1
  164. Practical Challenges - Latency refreshAfter Write = 2 mins Lookup entry Time expireAfter Write = ∞ v1
  165. Practical Challenges - Latency refreshAfter Write = 2 mins Lookup entry Time expireAfter Write = ∞
  166. Practical Challenges - Latency refreshAfter Write = 2 mins Lookup entry Time expireAfter Write = ∞ v1 v2
  167. Practical Challenges - Latency refreshAfter Write = 2 mins expireAfter Write = ∞ Lookup entry Time v1
  168. Practical Challenges - Latency refreshAfter Write = 2 mins expireAfter Write = ∞ Lookup entry Time v2
  169. Practical Challenges - Latency Practice 2 : Eager load all cache entries for all caches into Caffeine at application start Microservice application (pod) not made ready until caches load Hence, until caches load, no traf fi c passes through the initializing application pod! With this in place, we avoid latency spikes due to Redis issues & lazy init penalties!
  170. Practical Challenges Minimizing latency is a practice in noise reduction! The following factors impact latency in a streaming system! Network lookups needed for processing AWS Maintenance T asks (e.g. Kafka) Internal Operational Processes Autoscaling - Kafka Rebalancing
  171. Practical Challenges - Latency Anytime AWS patched our Kafka clusters, we noticed large latency spikes Ingest Normalize Enrich Route T ransform T ransmit Since AWS patches brokers in a rolling fashion, we typically see about 3 latency spikes over 30 minutes These violated our availability SLAs since we cannot afford more than 1 latency spike out of Lag SLA. Recall: 3 nines honored daily —> (1.44 minutes out of Lag SLA)
  172. MSK Patching Impacts Lag T ypical E2E Lag Lag Spikes due to MSK Maintenance
  173. Practical Challenges - Latency Between weekly releases or scheduled maintenance, one color side receives traf fi c (light), while the other color side receives none (dark) Ingest Normalize Enrich Route T ransform T ransmit Ingest Normalize Enrich Route T ransform T ransmit (Light) CustomerT raf fi c T o solve this issue, we adopted a Blue-Green operating model! (Dark) Patches applied
  174. Practical Challenges - Latency Ingest Normalize Enrich Route T ransform T ransmit Ingest Normalize Enrich Route T ransform T ransmit (Light) CustomerT raf fi c (Dark) Patches applied Cost Ef fi ciency : The Dark side is scaled to 0 to save on costs!
  175. Practical Challenges - Latency By switching traf fi c away from the side scheduled for AWS maintenance, we avoid a large cause of latency spikes With this success, we immediately began leveraging the Blue-Green model for other operations as well, such as:
  176. Practical Challenges - Latency By switching traf fi c away from the side scheduled for AWS maintenance, we avoid a large cause of latency spikes With this success, we immediately began leveraging the Blue-Green model for other operations as well, such as: Software Releases Canary T esting Load T esting Scheduled Maintenance T echnology Migrations Outage Recovery Doubling Capacity due to Sustained T raf fi c Burst
  177. Practical Challenges Minimizing latency is a practice in noise reduction! The following factors impact latency in a streaming system! Network lookups needed for processing AWS Maintenance T asks (e.g. Kafka) Internal Operational Processes Autoscaling - Kafka Rebalancing
  178. Practical Challenges - Latency During autoscaling, we noticed latency spikes! These lasted less than 1 minute, but compromise our availability SLAs This has to do with Kafka consumer rebalancing! Kafka consumers subscribe to topics T opics are split into physical partitions! These partitions are load balanced across multiple consuming K8s Pods in the same Consumer Group During autoscaling activity, the number of consumers in a Consumer Group changes, causing Kafka to execute a partition reassignment (a.k.a. rebalance)
  179. Practical Challenges - Latency The current default rebalancing behavior (via the range assignor) is a stop-the-world action. This causes pretty large pauses in consumption, which result in latency spikes! An alternative rebalancing algorithm came out called KICR or Kafka Incremental Cooperative Rebalancing (via the cooperative sticky assignor)
  180. Practical Challenges - Latency The graph below shows the impact of using KCIR. KCIR Default Rebalancing
  181. Practical Challenges - Latency Unfortunately, in using the Cooperative Sticky Assignor, we noticed signi fi cant duplicate consumption during autoscaling behavior. In one load test, duplication consumption @ one service > 100%. ( https://issues.apache.org/jira/browse/KAFKA-14757 )
  182. Conclusion • We now have a system with the NFRs that we desire! • While we’ve covered many key elements, we had to skip some for time (e.g. Isolation, Containerization, K8s Autoscaling). • These will be covered in upcoming blogs! Follow for updates on (@r39132) • We have also shown a real-life system and challenges that go with it!
  183. Acknowledgments PayPal Datazoom • Vincent Chen • Anisha Nainani • Pramod Garre • Harsh Bhimani • Nirmalya Ghosh • Yash Shah • Deepak Mohankumar Chandramouli • Dheeraj Rampally • Romit Mehta • Kafka team (Maulin Vasavada, Na Yang, Doron Mimon) • Sandhu Santhakumar • Akhil S. Varughese • Shiju V. • Sunny Dabas • Shahanas S. • Naveen Chandra • Thasni M. Salim • Serena Eldhose • Gordon Lai • Fajib Aboobacker • Josh Evans • Bob Carlson • T ony Gentille
Advertisement