Discover Kafka on OpenShift:
Processing Real-Time Financial Events at Scale
The opinions expressed in this presentation are those of the presenter, in their individual capacity, and not necessarily those of Discover.
Anvesh Samineni
Senior Software Engineer
Ehfaj Khan
Principal Software Engineer
Provisioning Kafka Infrastructure on OpenShift for High Availability and Performance
Kafka Deployment and Rollback Strategy in OpenShift
Multi-Cluster Replication Design and Failover Strategy
Agenda
Provisioning Kafka Infrastructure on OpenShift for High Availability and Performance
Page Cache
• Page Cache is shared
• Performance might vary
• Pods are dynamically provisioned
• Larger node, many pods
• Page Cache is dedicated
• Less performance variations
• Only Kafka pod is provisioned
• Smaller node, only Kafka pod
Provisioning Kafka Infrastructure on OpenShift for High Availability and Performance
Assigning Pods to Nodes (1/4)
Pod
Node Affinity
Kafka pod should go to Kafka node only
Node
Pod Anti-Affinity
How Kafka pods should be placed relative to one another
(Required: Do not schedule if a Kafka pod already exists in the node.)
Provisioning Kafka Infrastructure on OpenShift for High Availability and Performance
Pod
Assigning Pods to Nodes (2/4)
Provisioning Kafka Infrastructure on OpenShift for High Availability and Performance
Pod Anti-Affinity
How Kafka pods should be placed relative to one another
(Preferred: Try to schedule Kafka pods across AZs.)
Assigning Pods to Nodes (3/4)
Pod
Provisioning Kafka Infrastructure on OpenShift for High Availability and Performance
Node
Pod
Assigning Pods to Nodes (4/4)
Taints and Tolerations
Other pod should NOT go to Kafka node
Provisioning Kafka Infrastructure on OpenShift for High Availability and Performance
Disruptions:
• Draining nodes accidentally
• Deleting many pods at a time
Handling Disruptions
PodDisruptionBudget
Limit number of concurrent disruptions
(Example: minAvailable = 2)
Since 2 concurrent disruptions
lead to 1 Available Kafka Pod
Dedicated nodes for Kafka pods
Node Affinity
Kafka pod should go to Kafka node only
Pod Anti-Affinity
How Kafka pods should be placed relative to one another
• Required: Do not schedule if a Kafka pod already exists in the node.
• Preferred: Try to schedule Kafka pods across AZs.
Taints and Tolerations
Other pod should NOT go to Kafka node
PodDisruptionBudget
Limit number of concurrent disruptions
Provisioning Kafka Infrastructure on OpenShift for High Availability and Performance
Summary
Kafka Deployment and Rollback Strategy in OpenShift
Repeat for every pod except Active Controller
(Starting from last one)
1. Delete the pod
2. Wait till URP=0
3. Once URP=0, delete next pod
1. Delete Active Controller Pod
2. Wait till URP=0
Identify the Active Controller Pod
(Upgrade last)
Deployment Strategy: onDelete
Deployment Strategy
Kafka Deployment and Rollback Strategy in OpenShift
Repeat below for every pod except
Active Controller:
(Starting from last one)
1. Delete the pod
2. Wait till URP=0
3. Once URP=0, delete next pod
One of the Pod fails to restart
Revert StatefulSet to the previous
version
Deployment Strategy: onDelete
Identify the Active Controller Pod
(Upgrade last)
Deployment Strategy: onDelete Repeat below for all upgraded pods
in the reverse order of upgrade:
(Start with pod that failed to restart)
1. Delete the pod
2. Wait till URP=0
3. Once URP=0, delete previous pod
Rollback Strategy
Multi-Cluster Replication Design and Failover Strategy
Replicator replicates:
• Topics
• Messages
• Consumer groups
Multi-Cluster Replication Design and Failover Strategy
During Failover:
• Flip to the bootstrap URL of secondary cluster
• Stop the Replicator
Multi-Cluster Replication Design and Failover Strategy
Important to Enable Failover
Monitor Replicator
• Connectors are running
• Provision sufficient tasks
• No replication lag
Centralized Schema Registry
Enable Timestamp Interceptors
• Allows subscription to continue in the secondary cluster where it left off in the primary cluster
• Consumer groups in the secondary cluster are created by the Replicator
Provision ACLs for producers and consumers in the secondary cluster
Thank You

Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (Anvesh Samineni, Discover Financial) Kafka Summit 2020

  • 1.
    Discover Kafka onOpenShift: Processing Real-Time Financial Events at Scale The opinions expressed in this presentation are those of the presenter, in their individual capacity, and not necessarily those of Discover. Anvesh Samineni Senior Software Engineer Ehfaj Khan Principal Software Engineer
  • 2.
    Provisioning Kafka Infrastructureon OpenShift for High Availability and Performance Kafka Deployment and Rollback Strategy in OpenShift Multi-Cluster Replication Design and Failover Strategy Agenda
  • 3.
    Provisioning Kafka Infrastructureon OpenShift for High Availability and Performance Page Cache • Page Cache is shared • Performance might vary • Pods are dynamically provisioned • Larger node, many pods • Page Cache is dedicated • Less performance variations • Only Kafka pod is provisioned • Smaller node, only Kafka pod
  • 4.
    Provisioning Kafka Infrastructureon OpenShift for High Availability and Performance Assigning Pods to Nodes (1/4) Pod Node Affinity Kafka pod should go to Kafka node only Node
  • 5.
    Pod Anti-Affinity How Kafkapods should be placed relative to one another (Required: Do not schedule if a Kafka pod already exists in the node.) Provisioning Kafka Infrastructure on OpenShift for High Availability and Performance Pod Assigning Pods to Nodes (2/4)
  • 6.
    Provisioning Kafka Infrastructureon OpenShift for High Availability and Performance Pod Anti-Affinity How Kafka pods should be placed relative to one another (Preferred: Try to schedule Kafka pods across AZs.) Assigning Pods to Nodes (3/4) Pod
  • 7.
    Provisioning Kafka Infrastructureon OpenShift for High Availability and Performance Node Pod Assigning Pods to Nodes (4/4) Taints and Tolerations Other pod should NOT go to Kafka node
  • 8.
    Provisioning Kafka Infrastructureon OpenShift for High Availability and Performance Disruptions: • Draining nodes accidentally • Deleting many pods at a time Handling Disruptions PodDisruptionBudget Limit number of concurrent disruptions (Example: minAvailable = 2) Since 2 concurrent disruptions lead to 1 Available Kafka Pod
  • 9.
    Dedicated nodes forKafka pods Node Affinity Kafka pod should go to Kafka node only Pod Anti-Affinity How Kafka pods should be placed relative to one another • Required: Do not schedule if a Kafka pod already exists in the node. • Preferred: Try to schedule Kafka pods across AZs. Taints and Tolerations Other pod should NOT go to Kafka node PodDisruptionBudget Limit number of concurrent disruptions Provisioning Kafka Infrastructure on OpenShift for High Availability and Performance Summary
  • 10.
    Kafka Deployment andRollback Strategy in OpenShift Repeat for every pod except Active Controller (Starting from last one) 1. Delete the pod 2. Wait till URP=0 3. Once URP=0, delete next pod 1. Delete Active Controller Pod 2. Wait till URP=0 Identify the Active Controller Pod (Upgrade last) Deployment Strategy: onDelete Deployment Strategy
  • 11.
    Kafka Deployment andRollback Strategy in OpenShift Repeat below for every pod except Active Controller: (Starting from last one) 1. Delete the pod 2. Wait till URP=0 3. Once URP=0, delete next pod One of the Pod fails to restart Revert StatefulSet to the previous version Deployment Strategy: onDelete Identify the Active Controller Pod (Upgrade last) Deployment Strategy: onDelete Repeat below for all upgraded pods in the reverse order of upgrade: (Start with pod that failed to restart) 1. Delete the pod 2. Wait till URP=0 3. Once URP=0, delete previous pod Rollback Strategy
  • 12.
    Multi-Cluster Replication Designand Failover Strategy Replicator replicates: • Topics • Messages • Consumer groups
  • 13.
    Multi-Cluster Replication Designand Failover Strategy During Failover: • Flip to the bootstrap URL of secondary cluster • Stop the Replicator
  • 14.
    Multi-Cluster Replication Designand Failover Strategy Important to Enable Failover Monitor Replicator • Connectors are running • Provision sufficient tasks • No replication lag Centralized Schema Registry Enable Timestamp Interceptors • Allows subscription to continue in the secondary cluster where it left off in the primary cluster • Consumer groups in the secondary cluster are created by the Replicator Provision ACLs for producers and consumers in the secondary cluster
  • 15.