Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Flink Job Downtime

•

1 like•312 views

Flink Forward San Francisco 2022. Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover. by Mason Chen

Technology

Mason Chen | Apple
Multi Cluster Kafka Source
THIS IS NOT A CONTRIBUTION

Agenda
Motivation

FLIP 27 Kafka Source

Source Design

Example

Manual Migration Steps
Bring up new cluster

Manual Migration Steps
Wait for consumer to drain

Manual Migration Steps
Source uid and cluster change

Manual Migration Steps
Upgrade with non restore state

Manual Migration Steps
Increase parallelism for lag

Manual Migration Steps
Revert to steady state

Manual Migration Steps
When can we remove nonactive cluster?

User Manual Migration Steps
• Change source uid

• Change bootstrap server

• Upgrade application

• With non restore state

• Change parallelism and resources to catch with lag

• Revert to steady state when caught up

Manual Migration Steps
• Application downtime

• Need to increase system resources for catchup

• User manual toil

• User could have 100+ jobs

• Multiple hours of team coordination
Drawbacks

Scaling Multiple Kafka Clusters
• Hybrid cloud: on-prem, private cloud and public cloud providers

• Scalability

• Topic sharding

• Operability and Failover

• In place upgrade is complex and error prone

FLIP 27 Source
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/datastream/sources/

Kafka Metadata Service
• KafkaStream

• Logical abstraction to physical
clusters and topics

• describeStreams(Collection<String>
streamIds);

• Pluggable implementation

• File based configmap

Extension of FLIP 27 Major Components
• Kafka Source components

• Polling, commit, checkpoint, split assignment,

• Source Event RPC

• Enumerator Context Proxy

• Split assignment and wrapping cluster info

• Context thread pools

Migration with Multi Cluster Kafka Source

Migration with Multi Cluster Kafka Source
Initial metadata

Migration with Multi Cluster Kafka Source
Bring up new cluster

Migration with Multi Cluster Kafka Source
Add new cluster metadata

Migration with Multi Cluster Kafka Source
Reconcile metadata

Migration with Multi Cluster Kafka Source
Remove old cluster

Multi Cluster Kafka Source Benefits
• Migrations and failover automated transparently within source

• Simplify operations between compute and storage infra

• Hybrid Source compatible

• Can be leveraged for topic migration

Future Work
• Integrate with split level watermark alignment

• Optimizations to remove only aﬀected readers

• FLIP-246 (https://cwiki.apache.org/confluence/display/FLINK/
FLIP-246%3A+Multi+Cluster+Kafka+Source)

What's hot

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewenconfluent

“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward

Using Queryable State for Fun and ProfitFlink Forward

Batch Processing at Scale with Flink & IcebergFlink Forward

Apache Flink in the Cloud-Native EraFlink Forward

Dynamic Rule-based Real-time Market Data AlertsFlink Forward

Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit

The top 3 challenges running multi-tenant Flink at scaleFlink Forward

Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward

Producer Performance Tuning for Apache KafkaJiangjie Qin

Practical learnings from running thousands of Flink jobsFlink Forward

Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward

Introduction to Kafka Cruise ControlJiangjie Qin

Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin KnaufVerverica

A Deep Dive into Kafka Controllerconfluent

Near real-time statistical modeling and anomaly detection using Flink!Flink Forward

Kafka Connect - debeziumKasun Don

Changelog Stream Processing with Apache FlinkFlink Forward

How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward

When NOT to use Apache Kafka?Kai Wähner

What's hot (20)

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen

“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...

Using Queryable State for Fun and Profit

Batch Processing at Scale with Flink & Iceberg

Apache Flink in the Cloud-Native Era

Dynamic Rule-based Real-time Market Data Alerts

Flexible and Real-Time Stream Processing with Apache Flink

The top 3 challenges running multi-tenant Flink at scale

Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...

Producer Performance Tuning for Apache Kafka

Practical learnings from running thousands of Flink jobs

Tame the small files problem and optimize data layout for streaming ingestion...

Introduction to Kafka Cruise Control

Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf

A Deep Dive into Kafka Controller

Near real-time statistical modeling and anomaly detection using Flink!

Kafka Connect - debezium

Changelog Stream Processing with Apache Flink

How to build a streaming Lakehouse with Flink, Kafka, and Hudi

When NOT to use Apache Kafka?

Similar to Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Flink Job Downtime

Confluent Developer Trainingconfluent

Apache NiFi: A Drag and Drop ApproachCalculated Systems

Event-Driven Model Serving: Stream Processing vs. RPC with Kafka and TensorFl...confluent

Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...Kai Wähner

Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...confluent

Using Apache NiFi with Apache Pulsar for Fast Data On-RampTimothy Spann

[Spark Summit 2017 NA] Apache Spark on KubernetesTimothy Chen

Confluent Operations Training for Apache Kafkaconfluent

Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...confluent

GDG Taipei 2020 - Cloud and On-premises Applications Integration Using Event-...Rich Lee

Kubernetes 1.16 and rancher 2.3 enhancementsSaiyam Pathak

Managing multi tenant resource toward Hive 2.0Kai Sasaki

Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll

Flying to clouds - can it be easy? Cloud Native ApplicationsJacek Bukowski

JDD 2016 - Jacek Bukowski - "Flying To Clouds" - Can It Be Easy?PROIDEA

OpenFaaS serverless framework for Docker and Kubernetes - LondonAlex Ellis

8th Athens Big Data Meetup - 1st Talk - Riding The Streaming Wave DIY StyleAthens Big Data

Riding the Streaming Wave DIY styleKonstantine Karantasis

Kafka ExplainatonNguyenChiHoangMinh

Openstack days sv building highly available services using kubernetes (preso)Allan Naim

Similar to Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Flink Job Downtime (20)

Confluent Developer Training

Apache NiFi: A Drag and Drop Approach

Event-Driven Model Serving: Stream Processing vs. RPC with Kafka and TensorFl...

Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...

Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...

Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp

[Spark Summit 2017 NA] Apache Spark on Kubernetes

Confluent Operations Training for Apache Kafka

Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...

GDG Taipei 2020 - Cloud and On-premises Applications Integration Using Event-...

Kubernetes 1.16 and rancher 2.3 enhancements

Managing multi tenant resource toward Hive 2.0

Being Ready for Apache Kafka - Apache: Big Data Europe 2015

Flying to clouds - can it be easy? Cloud Native Applications

JDD 2016 - Jacek Bukowski - "Flying To Clouds" - Can It Be Easy?

OpenFaaS serverless framework for Docker and Kubernetes - London

8th Athens Big Data Meetup - 1st Talk - Riding The Streaming Wave DIY Style

Riding the Streaming Wave DIY style

Kafka Explainaton

Openstack days sv building highly available services using kubernetes (preso)

Recently uploaded

Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10

10 Differences between Sales Cloud and CPQ, Blanka DoktorováCzechDreamin

Mission to Decommission: Importance of Decommissioning Products to Increase E...Product School

Knowledge engineering: from people to machines and backElena Simperl

To Graph or Not to Graph Knowledge Graph Architectures and LLMsPaul Groth

Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomCzechDreamin

SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...CzechDreamin

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxAbida Shariff

Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software

PLAI - Acceleration Program for Generative A.I. StartupsStefano

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Product School

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...Product School

Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlPeter Udo Diehl

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...Product School

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...Product School

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Thierry Lestable

JMeter webinar - integration with InfluxDB and GrafanaRTTS

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck

Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...CzechDreamin

Recently uploaded (20)

Connector Corner: Automate dynamic content and events by pushing a button

10 Differences between Sales Cloud and CPQ, Blanka Doktorová

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Knowledge engineering: from people to machines and back

To Graph or Not to Graph Knowledge Graph Architectures and LLMs

Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom

SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx

Essentials of Automations: Optimizing FME Workflows with Parameters

PLAI - Acceleration Program for Generative A.I. Startups

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

Integrating Telephony Systems with Salesforce: Insights and Considerations, B...

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

JMeter webinar - integration with InfluxDB and Grafana

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...

Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Flink Job Downtime

1. Mason Chen | Apple Multi Cluster Kafka Source THIS IS NOT A CONTRIBUTION

2. Agenda Motivation FLIP 27 Kafka Source Source Design Example

3. Flink Kafka Pipeline

4. Manual Migration Steps

5. Manual Migration Steps Bring up new cluster

6. Manual Migration Steps Swap producer

7. Manual Migration Steps Wait for consumer to drain

8. Manual Migration Steps Source uid and cluster change

9. Manual Migration Steps Upgrade with non restore state

10. Manual Migration Steps Increase parallelism for lag

11. Manual Migration Steps Revert to steady state

12. Manual Migration Steps When can we remove nonactive cluster?

13. User Manual Migration Steps • Change source uid • Change bootstrap server • Upgrade application • With non restore state • Change parallelism and resources to catch with lag • Revert to steady state when caught up

14. Manual Migration Steps • Application downtime • Need to increase system resources for catchup • User manual toil • User could have 100+ jobs • Multiple hours of team coordination Drawbacks

15. Scaling Multiple Kafka Clusters • Hybrid cloud: on-prem, private cloud and public cloud providers • Scalability • Topic sharding • Operability and Failover • In place upgrade is complex and error prone

16. Agenda Motivation FLIP 27 Kafka Source Source Design Example

17. FLIP 27 Source https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/datastream/sources/

18. FLIP 27 Source https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/datastream/sources/

19. FLIP 27 Source https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/datastream/sources/

20. FLIP 27 Source https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/datastream/sources/

21. FLIP 27 Source https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/datastream/sources/

22. FLIP 27 Kafka Source

23. FLIP 27 Kafka Source

24. FLIP 27 Kafka Source

25. FLIP 27 Kafka Source

26. Agenda Motivation FLIP 27 Kafka Source Source Design Example

27. Kafka Metadata Service • KafkaStream • Logical abstraction to physical clusters and topics • describeStreams(Collection<String> streamIds); • Pluggable implementation • File based configmap

28. Multi Cluster Kafka Source Runtime

29. Multi Cluster Kafka Source Runtime

30. Multi Cluster Kafka Source Runtime

31. Multi Cluster Kafka Source Runtime

32. Multi Cluster Kafka Source Runtime

33. Multi Cluster Kafka Source Runtime

34. Multi Cluster Kafka Source Runtime

35. Multi Cluster Kafka Source Runtime

36. Multi Cluster Kafka Source Runtime

37. Multi Cluster Kafka Source Runtime

38. Multi Cluster Kafka Source Runtime

39. Multi Cluster Kafka Source Runtime

40. Multi Cluster Kafka Source Runtime

41. Extension of FLIP 27 Major Components • Kafka Source components • Polling, commit, checkpoint, split assignment, • Source Event RPC • Enumerator Context Proxy • Split assignment and wrapping cluster info • Context thread pools

42. Agenda Motivation FLIP 27 Kafka Source Source Design Example

43. Migration with Multi Cluster Kafka Source

44. Migration with Multi Cluster Kafka Source Initial metadata

45. Migration with Multi Cluster Kafka Source Bring up new cluster

46. Migration with Multi Cluster Kafka Source Bring up new cluster

47. Migration with Multi Cluster Kafka Source Add new cluster metadata

48. Migration with Multi Cluster Kafka Source Reconcile metadata

49. Migration with Multi Cluster Kafka Source Reconcile metadata

50. Migration with Multi Cluster Kafka Source Remove old cluster

51. Migration with Multi Cluster Kafka Source Reconcile metadata

52. Migration with Multi Cluster Kafka Source Reconcile metadata

53. Migration with Multi Cluster Kafka Source Remove old cluster

54. User Cluster Migration Steps

55. Multi Cluster Kafka Source Benefits • Migrations and failover automated transparently within source • Simplify operations between compute and storage infra • Hybrid Source compatible • Can be leveraged for topic migration

56. Future Work • Integrate with split level watermark alignment • Optimizations to remove only aﬀected readers • FLIP-246 (https://cwiki.apache.org/confluence/display/FLINK/ FLIP-246%3A+Multi+Cluster+Kafka+Source)

57. Q&A

Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Flink Job Downtime

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Flink Job Downtime

Similar to Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Flink Job Downtime (20)

More from Flink Forward

More from Flink Forward (10)

Recently uploaded

Recently uploaded (20)

Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Flink Job Downtime