Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
Flinkn Forward San Francisco 2022.
In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times.
by
Piotr Nowojski
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
Flink Forward San Francisco 2022.
In normal situations, the default Kafka consumer and producer configuration options work well. But we all know life is not all roses and rainbows and in this session we’ll explore a few knobs that can save the day in atypical scenarios. First, we'll take a detailed look at the parameters available when reading from Kafka. We’ll inspect the params helping us to spot quickly an application lock or crash, the ones that can significantly improve the performance and the ones to touch with gloves since they could cause more harm than benefit. Moreover we’ll explore the partitioning options and discuss when diverging from the default strategy is needed. Next, we’ll discuss the Kafka Sink. After browsing the available options we'll then dive deep into understanding how to approach use cases like sinking enormous records, managing spikes, and handling small but frequent updates.. If you want to understand how to make your application survive when the sky is dark, this session is for you!
by
Olena Babenko
Presentation at Strata Data Conference 2018, New York
The controller is the brain of Apache Kafka. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure.
Jun Rao outlines the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker.
Jun then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
Flink Forward San Francisco 2022.
Probably everyone who has written stateful Apache Flink applications has used one of the fault-tolerant keyed state primitives ValueState, ListState, and MapState. With RocksDB, however, retrieving and updating items comes at an increased cost that you should be aware of. Sometimes, these may not be avoidable with the current API, e.g., for efficient event-time stream-sorting or streaming joins where you need to iterate one or two buffered streams in the right order. With FLIP-220, we are introducing a new state primitive: BinarySortedMultiMapState. This new form of state offers you to (a) efficiently store lists of values for a user-provided key, and (b) iterate keyed state in a well-defined sort order. Both features can be backed efficiently by RocksDB with a 2x performance improvement over the current workarounds. This talk will go into the details of the new API and its implementation, present how to use it in your application, and talk about the process of getting it into Flink.
by
Nico Kruber
Changelog Stream Processing with Apache FlinkFlink Forward
Flink Forward San Francisco 2022.
The world is constantly changing. Data is continuously produced and thus should be consumed in a similar fashion by enterprise systems. Only this enables real-time decisions at scale. Message logs such as Apache Kafka can be found in almost every architecture, while databases and other batch systems still provide the foundation. Change Data Capture (CDC) propagates changes downstream. In this talk, we will highlight what it means to be a general data processor and how Flink can act as an integration hub. We present the current state of Flink and how it can power various use cases on both finite and infinite streams. We demonstrate Flink's SQL engine as a changelog processor that is shipped with an ecosystem tailored to process CDC data and maintain materialized views. We will use Kafka as an upsert log, Debezium for connecting to databases, and enrich streams of various sources. Finally, we will combine Flink's Table API with DataStream API for event-driven applications beyond SQL.
by
Timo Walther
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.
Evening out the uneven: dealing with skew in FlinkFlink Forward
Flink Forward San Francisco 2022.
When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment.
by
Jun Qin & Karl Friedrich
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
Flinkn Forward San Francisco 2022.
In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times.
by
Piotr Nowojski
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
Flink Forward San Francisco 2022.
In normal situations, the default Kafka consumer and producer configuration options work well. But we all know life is not all roses and rainbows and in this session we’ll explore a few knobs that can save the day in atypical scenarios. First, we'll take a detailed look at the parameters available when reading from Kafka. We’ll inspect the params helping us to spot quickly an application lock or crash, the ones that can significantly improve the performance and the ones to touch with gloves since they could cause more harm than benefit. Moreover we’ll explore the partitioning options and discuss when diverging from the default strategy is needed. Next, we’ll discuss the Kafka Sink. After browsing the available options we'll then dive deep into understanding how to approach use cases like sinking enormous records, managing spikes, and handling small but frequent updates.. If you want to understand how to make your application survive when the sky is dark, this session is for you!
by
Olena Babenko
Presentation at Strata Data Conference 2018, New York
The controller is the brain of Apache Kafka. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure.
Jun Rao outlines the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker.
Jun then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
Flink Forward San Francisco 2022.
Probably everyone who has written stateful Apache Flink applications has used one of the fault-tolerant keyed state primitives ValueState, ListState, and MapState. With RocksDB, however, retrieving and updating items comes at an increased cost that you should be aware of. Sometimes, these may not be avoidable with the current API, e.g., for efficient event-time stream-sorting or streaming joins where you need to iterate one or two buffered streams in the right order. With FLIP-220, we are introducing a new state primitive: BinarySortedMultiMapState. This new form of state offers you to (a) efficiently store lists of values for a user-provided key, and (b) iterate keyed state in a well-defined sort order. Both features can be backed efficiently by RocksDB with a 2x performance improvement over the current workarounds. This talk will go into the details of the new API and its implementation, present how to use it in your application, and talk about the process of getting it into Flink.
by
Nico Kruber
Changelog Stream Processing with Apache FlinkFlink Forward
Flink Forward San Francisco 2022.
The world is constantly changing. Data is continuously produced and thus should be consumed in a similar fashion by enterprise systems. Only this enables real-time decisions at scale. Message logs such as Apache Kafka can be found in almost every architecture, while databases and other batch systems still provide the foundation. Change Data Capture (CDC) propagates changes downstream. In this talk, we will highlight what it means to be a general data processor and how Flink can act as an integration hub. We present the current state of Flink and how it can power various use cases on both finite and infinite streams. We demonstrate Flink's SQL engine as a changelog processor that is shipped with an ecosystem tailored to process CDC data and maintain materialized views. We will use Kafka as an upsert log, Debezium for connecting to databases, and enrich streams of various sources. Finally, we will combine Flink's Table API with DataStream API for event-driven applications beyond SQL.
by
Timo Walther
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.
Evening out the uneven: dealing with skew in FlinkFlink Forward
Flink Forward San Francisco 2022.
When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment.
by
Jun Qin & Karl Friedrich
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
Introducing the Apache Flink Kubernetes OperatorFlink Forward
Flink Forward San Francisco 2022.
The Apache Flink Kubernetes Operator provides a consistent approach to manage Flink applications automatically, without any human interaction, by extending the Kubernetes API. Given the increasing adoption of Kubernetes based Flink deployments the community has been working on a Kubernetes native solution as part of Flink that can benefit from the rich experience of community members and ultimately make Flink easier to adopt. In this talk we give a technical introduction to the Flink Kubernetes Operator and demonstrate the core features and use-cases through in-depth examples."
by
Thomas Weise
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...HostedbyConfluent
Apache Kafka is one of the most commonly used connectors with Apache Flink for exactly-once streaming use cases. The combination of both systems allows you to build mission-critical systems that require low end-to-end latency and exactly-once processing eg. banks processing transactions. In Apache Flink 1.14, we released a new KafkaSink based on Apache Flink’s unified Sink interface that natively supports streaming and batch executions.
However, we needed to stretch Kafka’s transactions API to fully support exactly-once processing in Flink. In this talk, we will start with a quick recap of Apache Kafka’s transactions and Flink’s checkpointing mechanism. Then, we describe the two-phase commit protocol implemented in KafkaSink in-depth and emphasize the difficulties we have overcome when applying Kafka’s transaction API to longer-lasting transactions.
We explain how we ensure performant writing to Apache Kafka and how the KafkaSink recovery works.
In summary, this talk should give users a deep dive into how Apache Flink leverages Apache Kafka’s transactions and developers an overview of what they have to consider when using Apache Kafka’s transactions.
A Day in the Life of a ClickHouse Query Webinar Slides Altinity Ltd
Why do queries run out of memory? How can I make my queries even faster? How should I size ClickHouse nodes for best cost-efficiency? The key to these questions and many others is knowing what happens inside ClickHouse when a query runs. This webinar is a gentle introduction to ClickHouse internals, focusing on topics that will help your applications run faster and more efficiently. We’ll discuss the basic flow of query execution, dig into how ClickHouse handles aggregation and joins, and show you how ClickHouse distributes processing within a single CPU as well as across many nodes in the network. After attending this webinar you’ll understand how to open up the black box and see what the parts are doing.
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
Apache Flink is a distributed stream processing framework that allows users to process and analyze data in real-time. At LinkedIn, we developed a fully managed stream processing platform on Flink running on K8s to power hundreds of stream processing pipelines in production. This platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management etc. We provide a rich authoring and testing environment which allows users to create, test, and deploy their streaming jobs in a self-serve fashion within minutes. Users can focus on their business logic, leaving the Flink platform to take care of management aspects such as split deployment, resource provisioning, auto-scaling, job monitoring, alerting, failure recovery and much more. In this talk, we will introduce the overall platform architecture, highlight the unique value propositions that it brings to stream processing at LinkedIn and share the experiences and lessons we have learned.
Flink Forward San Francisco 2022.
The Table API is one of the most actively developed components of Flink in recent time. Inspired by databases and SQL, it encapsulates concepts many developers are familiar with. It can be used with both bounded and unbounded streams in a unified way. But from afar it can be difficult to keep track of what this API is capable of and how it relates to Flink's other APIs. In this talk, we will explore the current state of Table API. We will show how it can be used as a batch processor, a changelog processor, or a streaming ETL tool with many built-in functions and operators for deduplicating, joining, and aggregating data. By comparing it to the DataStream API we will highlight differences and elaborate on when to use which API. We will demonstrate hybrid pipelines in which both APIs interact with one another and contribute their unique strengths. Finally, we will take a look at some of the most recent additions as a first step to stateful upgrades.
by
David Andreson
Improving Kafka at-least-once performance at UberYing Zheng
At Uber, we are seeing an increasing demand for Kafka at-least-once delivery (asks=all). So far, we are running a dedicated at-least-once Kafka cluster with special settings. With a very low workload, the dedicated at-least-once cluster has been working well for more than a year. When trying to allow at-least-once producing on the regular Kafka clusters, the producing performance was the main concern. We spent some effort on this issue in the recent months, and managed to reduce at-least-once producer latency by about 80% with code changes and configuration tuning. When acks=0, these improvements also help increasing Kafka throughput and reducing Kafka end-to-end latency.
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkTimothy Spann
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
20-Feb-2024
In this talk, I will walk through how someone can set up and run continuous SQL queries against Kafka topics utilizing Apache Flink. We will walk through creating Kafka topics, schemas, and publishing data.
We will then cover consuming Kafka data, joining Kafka topics, and inserting new events into Kafka topics as they arrive. This basic overview will show hands-on techniques, tips, and examples of how to do this.
Tim Spann
Tim Spann is the Principal Developer Advocate for Data in Motion @ Cloudera where he works with Apache Kafka, Apache Flink, Apache NiFi, Apache Iceberg, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.
Battle of the Stream Processing Titans – Flink versus RisingWaveYingjun Wu
The world of real-time data processing is constantly evolving, with new technologies and platforms emerging to meet the ever-increasing demands of modern data-driven businesses. Apache Flink and RisingWave are two powerful stream processing solutions that have gained significant traction in recent years. But which platform is right for your organization? Karin Wolok and Yingjun Wu go head-to-head to compare and contrast the strengths and limitations of Flink and RisingWave. They’ll also share real-world use cases, best practices for optimizing performance and efficiency, and key considerations for selecting the right solution for your specific business needs.
Storing State Forever: Why It Can Be Good For Your AnalyticsYaroslav Tkachenko
State is an essential part of the modern streaming pipelines: it enables a variety of foundational capabilities like windowing, aggregation, enrichment, etc. But usually, the state is either transient, so we only keep it until the window is closed, or it's fairly small and doesn't grow much. But what if we treat the state differently? The keyed state in Flink can be scaled vertically and horizontally, it's reliable and fault-tolerant... so is scaling a stateful Flink application that different from scaling any data store like Kafka or MySQL?
At Shopify, we've worked on a massive analytical data pipeline that's needed to support complex streaming joins and correctly handle arbitrarily late-arriving data. We came up with an idea to never clear state and support joins this way. We've made a successful proof of concept, ingested all historical transactional Shopify data and ended up storing more than 10 TB of Flink state. In the end, it allowed us to achieve 100% data correctness.
The top 3 challenges running multi-tenant Flink at scaleFlink Forward
Apache Flink is the foundation for Decodable's real-time SaaS data platform. Flink runs critical data processing jobs with strong security requirements. In addition, Decodable has to scale to thousands of tenants, power various use cases, provide an intuitive user experience and maintain cost-efficiency. We've learned a lot of lessons while building and maintaining the platform. In this talk, I'll share the top 3 toughest challenges building and operating this platform with Flink, and how we solved them.
Like many other messaging systems, Kafka has put limit on the maximum message size. User will fail to produce a message if it is too large. This limit makes a lot of sense and people usually send to Kafka a reference link which refers to a large message stored somewhere else. However, in some scenarios, it would be good to be able to send messages through Kafka without external storage. At LinkedIn, we have a few use cases that can benefit from such feature. This talk covers our solution to send large message through Kafka without additional storage.
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...HostedbyConfluent
More and more Enterprises are relying on Apache Kafka to run their businesses. Cluster administrators need the ability to mirror data between clusters to provide high availability and disaster recovery.
MirrorMaker 2, released recently as part of Kafka 2.4.0, allows you to mirror multiple clusters and create many replication topologies. Learn all about this awesome new tool and how to reliably and easily mirror clusters.
We will first describe how MirrorMaker 2 works, including how it addresses all the shortcomings of MirrorMaker 1. We will also cover how to decide between its many deployment modes. Finally, we will share our experience running it in production as well as our tips and tricks to get a smooth ride.
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin KnaufVerverica
The need to enrich a fast, high volume data stream with slow-changing reference data is probably one of the most wide-spread requirements in stream processing applications. Apache Flink's built-in join functionalities and its flexible lower-level APIs support stream enrichment in various ways depending on the specific requirements of the use case at hand. In this webinar, I like to provide an overview of the basic methods to enrich a data stream with Apache Flink and highlight use cases, limitations, advantages and disadvantages of each.
Flink Forward San Francisco 2022.
This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications.
We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs.
After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are.
by
David Moravek
Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...Flink Forward
http://flink-forward.org/kb_sessions/connecting-apache-flink-with-the-world-reviewing-the-streaming-connectors/
Getting data in and out of Flink in a reliable fashion is one of the most important tasks of a stream processor. This talk will review the most important and frequently used connectors in Flink. Apache Kafka and Amazon Kinesis Streams both fall into the same category of distributed, high-throughput and durable publish-subscribe messaging systems. The talk will explain how the connectors in Flink for these systems are implemented. In particular we’ll focus on how we ensure exactly-once semantics while consuming data and how offsets/sequence numbers are handled. We will also review two generic tools in Flink for connectors: A message acknowledging source for classical message queues (like those implementing AMQP) and a generic write ahead log sink, using Flink’s state backend abstraction. The objective of the talk is to explain the internals of the streaming connectors, so that people can understand their behavior, configure them properly and implement their own connectors.
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
Introducing the Apache Flink Kubernetes OperatorFlink Forward
Flink Forward San Francisco 2022.
The Apache Flink Kubernetes Operator provides a consistent approach to manage Flink applications automatically, without any human interaction, by extending the Kubernetes API. Given the increasing adoption of Kubernetes based Flink deployments the community has been working on a Kubernetes native solution as part of Flink that can benefit from the rich experience of community members and ultimately make Flink easier to adopt. In this talk we give a technical introduction to the Flink Kubernetes Operator and demonstrate the core features and use-cases through in-depth examples."
by
Thomas Weise
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...HostedbyConfluent
Apache Kafka is one of the most commonly used connectors with Apache Flink for exactly-once streaming use cases. The combination of both systems allows you to build mission-critical systems that require low end-to-end latency and exactly-once processing eg. banks processing transactions. In Apache Flink 1.14, we released a new KafkaSink based on Apache Flink’s unified Sink interface that natively supports streaming and batch executions.
However, we needed to stretch Kafka’s transactions API to fully support exactly-once processing in Flink. In this talk, we will start with a quick recap of Apache Kafka’s transactions and Flink’s checkpointing mechanism. Then, we describe the two-phase commit protocol implemented in KafkaSink in-depth and emphasize the difficulties we have overcome when applying Kafka’s transaction API to longer-lasting transactions.
We explain how we ensure performant writing to Apache Kafka and how the KafkaSink recovery works.
In summary, this talk should give users a deep dive into how Apache Flink leverages Apache Kafka’s transactions and developers an overview of what they have to consider when using Apache Kafka’s transactions.
A Day in the Life of a ClickHouse Query Webinar Slides Altinity Ltd
Why do queries run out of memory? How can I make my queries even faster? How should I size ClickHouse nodes for best cost-efficiency? The key to these questions and many others is knowing what happens inside ClickHouse when a query runs. This webinar is a gentle introduction to ClickHouse internals, focusing on topics that will help your applications run faster and more efficiently. We’ll discuss the basic flow of query execution, dig into how ClickHouse handles aggregation and joins, and show you how ClickHouse distributes processing within a single CPU as well as across many nodes in the network. After attending this webinar you’ll understand how to open up the black box and see what the parts are doing.
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
Apache Flink is a distributed stream processing framework that allows users to process and analyze data in real-time. At LinkedIn, we developed a fully managed stream processing platform on Flink running on K8s to power hundreds of stream processing pipelines in production. This platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management etc. We provide a rich authoring and testing environment which allows users to create, test, and deploy their streaming jobs in a self-serve fashion within minutes. Users can focus on their business logic, leaving the Flink platform to take care of management aspects such as split deployment, resource provisioning, auto-scaling, job monitoring, alerting, failure recovery and much more. In this talk, we will introduce the overall platform architecture, highlight the unique value propositions that it brings to stream processing at LinkedIn and share the experiences and lessons we have learned.
Flink Forward San Francisco 2022.
The Table API is one of the most actively developed components of Flink in recent time. Inspired by databases and SQL, it encapsulates concepts many developers are familiar with. It can be used with both bounded and unbounded streams in a unified way. But from afar it can be difficult to keep track of what this API is capable of and how it relates to Flink's other APIs. In this talk, we will explore the current state of Table API. We will show how it can be used as a batch processor, a changelog processor, or a streaming ETL tool with many built-in functions and operators for deduplicating, joining, and aggregating data. By comparing it to the DataStream API we will highlight differences and elaborate on when to use which API. We will demonstrate hybrid pipelines in which both APIs interact with one another and contribute their unique strengths. Finally, we will take a look at some of the most recent additions as a first step to stateful upgrades.
by
David Andreson
Improving Kafka at-least-once performance at UberYing Zheng
At Uber, we are seeing an increasing demand for Kafka at-least-once delivery (asks=all). So far, we are running a dedicated at-least-once Kafka cluster with special settings. With a very low workload, the dedicated at-least-once cluster has been working well for more than a year. When trying to allow at-least-once producing on the regular Kafka clusters, the producing performance was the main concern. We spent some effort on this issue in the recent months, and managed to reduce at-least-once producer latency by about 80% with code changes and configuration tuning. When acks=0, these improvements also help increasing Kafka throughput and reducing Kafka end-to-end latency.
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkTimothy Spann
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
20-Feb-2024
In this talk, I will walk through how someone can set up and run continuous SQL queries against Kafka topics utilizing Apache Flink. We will walk through creating Kafka topics, schemas, and publishing data.
We will then cover consuming Kafka data, joining Kafka topics, and inserting new events into Kafka topics as they arrive. This basic overview will show hands-on techniques, tips, and examples of how to do this.
Tim Spann
Tim Spann is the Principal Developer Advocate for Data in Motion @ Cloudera where he works with Apache Kafka, Apache Flink, Apache NiFi, Apache Iceberg, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.
Battle of the Stream Processing Titans – Flink versus RisingWaveYingjun Wu
The world of real-time data processing is constantly evolving, with new technologies and platforms emerging to meet the ever-increasing demands of modern data-driven businesses. Apache Flink and RisingWave are two powerful stream processing solutions that have gained significant traction in recent years. But which platform is right for your organization? Karin Wolok and Yingjun Wu go head-to-head to compare and contrast the strengths and limitations of Flink and RisingWave. They’ll also share real-world use cases, best practices for optimizing performance and efficiency, and key considerations for selecting the right solution for your specific business needs.
Storing State Forever: Why It Can Be Good For Your AnalyticsYaroslav Tkachenko
State is an essential part of the modern streaming pipelines: it enables a variety of foundational capabilities like windowing, aggregation, enrichment, etc. But usually, the state is either transient, so we only keep it until the window is closed, or it's fairly small and doesn't grow much. But what if we treat the state differently? The keyed state in Flink can be scaled vertically and horizontally, it's reliable and fault-tolerant... so is scaling a stateful Flink application that different from scaling any data store like Kafka or MySQL?
At Shopify, we've worked on a massive analytical data pipeline that's needed to support complex streaming joins and correctly handle arbitrarily late-arriving data. We came up with an idea to never clear state and support joins this way. We've made a successful proof of concept, ingested all historical transactional Shopify data and ended up storing more than 10 TB of Flink state. In the end, it allowed us to achieve 100% data correctness.
The top 3 challenges running multi-tenant Flink at scaleFlink Forward
Apache Flink is the foundation for Decodable's real-time SaaS data platform. Flink runs critical data processing jobs with strong security requirements. In addition, Decodable has to scale to thousands of tenants, power various use cases, provide an intuitive user experience and maintain cost-efficiency. We've learned a lot of lessons while building and maintaining the platform. In this talk, I'll share the top 3 toughest challenges building and operating this platform with Flink, and how we solved them.
Like many other messaging systems, Kafka has put limit on the maximum message size. User will fail to produce a message if it is too large. This limit makes a lot of sense and people usually send to Kafka a reference link which refers to a large message stored somewhere else. However, in some scenarios, it would be good to be able to send messages through Kafka without external storage. At LinkedIn, we have a few use cases that can benefit from such feature. This talk covers our solution to send large message through Kafka without additional storage.
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...HostedbyConfluent
More and more Enterprises are relying on Apache Kafka to run their businesses. Cluster administrators need the ability to mirror data between clusters to provide high availability and disaster recovery.
MirrorMaker 2, released recently as part of Kafka 2.4.0, allows you to mirror multiple clusters and create many replication topologies. Learn all about this awesome new tool and how to reliably and easily mirror clusters.
We will first describe how MirrorMaker 2 works, including how it addresses all the shortcomings of MirrorMaker 1. We will also cover how to decide between its many deployment modes. Finally, we will share our experience running it in production as well as our tips and tricks to get a smooth ride.
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin KnaufVerverica
The need to enrich a fast, high volume data stream with slow-changing reference data is probably one of the most wide-spread requirements in stream processing applications. Apache Flink's built-in join functionalities and its flexible lower-level APIs support stream enrichment in various ways depending on the specific requirements of the use case at hand. In this webinar, I like to provide an overview of the basic methods to enrich a data stream with Apache Flink and highlight use cases, limitations, advantages and disadvantages of each.
Flink Forward San Francisco 2022.
This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications.
We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs.
After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are.
by
David Moravek
Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...Flink Forward
http://flink-forward.org/kb_sessions/connecting-apache-flink-with-the-world-reviewing-the-streaming-connectors/
Getting data in and out of Flink in a reliable fashion is one of the most important tasks of a stream processor. This talk will review the most important and frequently used connectors in Flink. Apache Kafka and Amazon Kinesis Streams both fall into the same category of distributed, high-throughput and durable publish-subscribe messaging systems. The talk will explain how the connectors in Flink for these systems are implemented. In particular we’ll focus on how we ensure exactly-once semantics while consuming data and how offsets/sequence numbers are handled. We will also review two generic tools in Flink for connectors: A message acknowledging source for classical message queues (like those implementing AMQP) and a generic write ahead log sink, using Flink’s state backend abstraction. The objective of the talk is to explain the internals of the streaming connectors, so that people can understand their behavior, configure them properly and implement their own connectors.
http://flink-forward.org/kb_sessions/scaling-stream-processing-with-apache-flink-to-very-large-state/
The majority of streaming programs is ‘stateful’: Windowed Aggregations, Sessions, Joins, Complex Event Processing, Tables – they all require to keep some form of state across individual events. With the migration of more and more complex batch jobs or data processing pipelines to streaming applications, some streaming programs need to keep terabytes of state. Apache Flink implements a checkpointing-based recovery mechanism that guarantees exactly-once semantics for state also in the presence of failures. The cost of checkpointing and recovery depends on the size of the program’s state. In this talk, we will discuss the current status of stateful processing in Apache Flink, as well as the ongoing efforts to make Flink’s fault tolerance mechanism scale to very large state sizes, supporting frequent checkpoints and faster recovery of large state, without requiring excessive numbers of machines.
Kube-proxy enables access to Kubernetes services (virtual IPs backed by pods) by configuring client-side load-balancing on nodes. The first implementation relied on a userspace proxy which was not very performant. The second implementation used iptables and is still the one used in most Kubernetes clusters. Recently, the community introduced an alternative based on IPVS. This talk will start with a description of the different modes and how they work. It will then focus on the IPVS implementation, the improvements it brings, the issues we encountered and how we fixed them as well as the remaining challenges and how they could be addressed. Finally, the talk will present alternative solutions based on eBPF such as Cilium.
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...HostedbyConfluent
This talk is aimed to give developers who are interested to scale their streaming application with Exactly-Once (EOS) guarantees. Since the original release, EOS processing has received wide adoption as a much needed feature inside the community, and has also exposed various scalability and usability issues when applied in production systems.
To address those issues, we improved on the existing EOS model by integrating static Producer transaction semantics with dynamic Consumer group semantics. We will have a deep-dive into the newly added features (KIP-447), from which the audience will have more insight into the scalability v.s. semantics guarantees tradeoffs and how Kafka Streams specifically leveraged them to help scale EOS streaming applications written in this library. We would also present how the EOS code can be simplified with plain Producer and Consumer. Come to learn more if you wish to adopt this improved EOS feature and get started on building your own EOS application today!
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Guozhang Wang
Since the original release, EOS processing has received wide adoption as a much needed feature inside the community, and has also exposed various scalability and usability issues when applied in production systems.
To address those issues, we improved on the existing EOS model by integrating static Producer transaction semantics with dynamic Consumer group semantics. We will have a deep-dive into the newly added features (KIP-447), from which the audience will have more insight into the scalability v.s. semantics guarantees tradeoffs and how Kafka Streams specifically leveraged them to help scale EOS streaming applications written in this library.
5-LEC- 5.pptxTransport Layer. Transport Layer ProtocolsZahouAmel1
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transport Layer.
Transport Layer Protocols
Transpor
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...HostedbyConfluent
"Apache Flink’s Exactly-Once Semantics (EOS) integration for writing to Apache Kafka has several pitfalls, due mostly to the fact that the Kafka transaction protocol was not originally designed with distributed transactions in mind. The integration uses Java reflection hacks as a workaround, and the solution can still result in data loss under certain scenarios. Can we do better?
In this session, you’ll see how the Flink and Kafka communities are uniting to tackle these long-standing technical debts. We’ll introduce the basics of how Flink achieves EOS with external systems and explore the common hurdles that are encountered when implementing distributed transactions. Then we’ll dive into the details of the proposed changes to both the Kafka transaction protocol and Flink transaction coordination that seek to provide a more robust integration.
By the end of the talk, you’ll know the unique challenges of EOS with Flink and Kafka and the improvements you can expect across both projects."
More Data, More Problems: Scaling Kafka Mirroring Pipelines at LinkedInCelia Kung
For several years, LinkedIn has been using Kafka MirrorMaker as the mirroring solution for copying data between Kafka clusters across data centers. However, as LinkedIn data continued to grow, mirroring trillions of Kafka messages per day across data centers uncovered the scale limitations and operability challenges of Kafka MirrorMaker. To address these, we have developed a new mirroring solution, built on top our stream ingestion service, Brooklin. Brooklin’s mirroring solution aims to provide improved performance and stability, while facilitating better management via finer control of data pipelines. Through flushless Kafka produce, dynamic management of data pipelines, per-partition error handling and flow control, we are able to increase throughput, better withstand consume and produce failures and reduce overall operating costs. As a result, we have eliminated the major pain points of Kafka MirrorMaker.
In this talk, we will dive deeper into the challenges LinkedIn has faced with Kafka MirrorMaker, how we tackled them with Brooklin and our plans for iterating further on this new mirroring solution.
The slides are a revised version of the presentation made as a proposal for Kafka Summit. Here below the abstract sent:
Business continuity and load-balancing are key features in modern manufacturing and transportation environments. Manage real-time, scalable and distributed systems is a challenge for every software architect.
In the presentation, we demonstrate how Apache Kafka helped us to manage real-time load-balancing and business continuity across multiple datacenters.
Our platform is entirely based on message exchange between interconnected plugins: external I/O and internal plugins communicate using a message based interface. A similar logic can be found in Apache Kafka: giving to our platform an implementation of message flow based on Apache Kafka given us the possibility to distribute, across all datacenters involved, the messages exchanged internally between plugins.
The internal queue becomes a “Kafka multicast message queue” with multiple producer/consumer. Configuring which plugins interconnection shall use a “Kafka queue”, and which producer/consumer is active, it is possible to obtain both business continuity and load balancing.
Manufacturing and transportation environments communicate with sensors and actuators within the field: it is mandatory that every event is processed and no event is lost. In most cases sensors and actuators are redundant, but only one can be active at the same time. To reach this objective the entire platform has a global orchestrator in charge to arbitrate all plugins across all datacenters: using ZooKeeper distributed lock recipe the orchestrator is able to manage a global lock and, modifying producer/consumer active state, select continually which plugin is in charge to execute the work. Again the orchestrator has a global “objective function” which try to optimize the balancing of load between datacenters, ensuring meantime the business continuity in case of fault condition or disaster.
Since our platform is based on OPC-UA technology: Apache Kafka helped us to implement transparent redundant behavior of OPC-UA servers using same technic.
dA Platform is a production-ready platform for stream processing with Apache Flink®. The Platform includes open source Apache Flink, a stateful stream processing and event-driven application framework, and dA Application Manager, a central deployment and management component. dA Platform schedules clusters on Kubernetes, deploys stateful Flink applications, and controls these applications and their state.
January 2016 Flink Community Update & Roadmap 2016Robert Metzger
This presentation from the 13th Flink Meetup in Berlin contains the regular community update for January and a walkthrough of the most important upcoming features in 2016
This presentation held in at Inovex GmbH in Munich in November 2015 was about a general introduction of the streaming space, an overview of Flink and use cases of production users as presented at Flink Forward.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
2. a b c d e
a b c d e
Flink Kafka Consumer
Flink Kafka Consumer
Flink Map Operator
counter = 0
Zookeeper
offset partition 0: 0
offset partition 1: 0
Flink Checkpoint Coordinator
Pending:
Completed:
offsets = 0, 0
This toy example is reading from a Kafka topic with two partitions, each containing “a”, “b”, “c”, … as messages.
The offset is set to 0 for both partitions, a counter is initialized to 0.
3. a b c d e
a b c d e
Flink Kafka Consumer
Flink Kafka Consumer
Flink Map Operator
a
counter = 0
Zookeeper
offset partition 0: 0
offset partition 1: 0
Flink Checkpoint Coordinator
Pending:
Completed:
offsets = 1, 0
The Kafka consumer starts reading messages from partition 0. Message “a” is in-flight, the offset for the first
consumer has been set to 1.
4. a b c d e
a b c d e
Flink Kafka Consumer
Flink Kafka Consumer
Flink Map Operator
a
counter = 1
Zookeeper
offset partition 0: 0
offset partition 1: 0
Flink Checkpoint Coordinator
Pending:
Completed:
offsets = 2, 1
a
b
Trigger
Checkpoint at
source
Message “a” arrives at the counter, it is set to 1. The consumers both read the next records (“b” and “a”). The
offsets are set accordingly. In parallel, the checkpoint coordinator decides to trigger a checkpoint at the source …
5. a b c d e
a b c d e
Flink Kafka Consumer
Flink Kafka Consumer
Flink Map Operator a
counter = 2
Zookeeper
offset partition 0: 0
offset partition 1: 0
Flink Checkpoint Coordinator
Pending:
Completed:
offsets = 3, 1
a
b
offsets = 2, 1
c
The source has created a snapshot of its state (“offset=2,1”), which is now stored in the checkpoint coordinator.
The sources emitted a checkpoint barrier after messages “a” and “b”.
6. a b c d e
a b c d e
Flink Kafka Consumer
Flink Kafka Consumer
Flink Map Operator
counter = 3
Zookeeper
offset partition 0: 0
offset partition 1: 0
Flink Checkpoint Coordinator
Pending:
Completed:
offsets = 3, 2
a
b
offsets = 2, 1 counter = 3
c
b
The map operator has received checkpoint barriers from both sources. It checkpoints its state (counter=3) in the
coordinator. At the same time, the consumers are further reading more data from the Kafka partitions.
7. a b c d e
a b c d e
Flink Kafka Consumer
Flink Kafka Consumer
Flink Map Operator
counter = 4
Zookeeper
offset partition 0: 0
offset partition 1: 0
Flink Checkpoint Coordinator
Pending:
Completed:
offsets = 3, 2
a
offsets = 2, 1 counter = 3
c
b
Notify
checkpoint
complete
The checkpoint coordinator informs the Kafka consumer that the checkpoint has been completed. It commits the
checkpoints offsets into Zookeeper. Note that Flink is not relying on the Kafka offsets in ZK for restoring from failures
8. a b c d e
a b c d e
Flink Kafka Consumer
Flink Kafka Consumer
Flink Map Operator
counter = 4
Zookeeper
offset partition 0: 2
offset partition 1: 1
Flink Checkpoint Coordinator
Pending:
Completed:
offsets = 3, 2
a
offsets = 2, 1 counter = 3
c
b
Checkpoint in
Zookeeper
The checkpoint is now persisted in Zookeeper. External tools such as the Kafka Offset Checker can see the lag of the
consumer group.
9. a b c d e
a b c d e
Flink Kafka Consumer
Flink Kafka Consumer
Flink Map Operator
counter = 5
Zookeeper
offset partition 0: 2
offset partition 1: 1
Flink Checkpoint Coordinator
Pending:
Completed:
offsets = 4, 2
offsets = 2, 1 counter = 3
c
b
d
The processing further advances
10. a b c d e
a b c d e
Flink Kafka Consumer
Flink Kafka Consumer
Flink Map Operator
counter = 5
Zookeeper
offset partition 0: 2
offset partition 1: 1
Flink Checkpoint Coordinator
Pending:
Completed:
offsets = 4, 2
offsets = 2, 1 counter = 3
c
b
d
Failure
Some failure has happened (such as worker failure)
11. a b c d e
a b c d e
Flink Kafka Consumer
Flink Kafka Consumer
Flink Map Operator
counter = 3
Zookeeper
offset partition 0: 2
offset partition 1: 1
Flink Checkpoint Coordinator
Pending:
Completed:
offsets = 2, 1
offsets = 2, 1 counter = 3 Reset all
operators to
last completed
checkpoint
The checkpoint coordinator restores the state at all the operators participating at the checkpointing. The Kafka
sources start from offset 2 and 1, the counter’s value is 3.
12. a b c d e
a b c d e
Flink Kafka Consumer
Flink Kafka Consumer
Flink Map Operator
counter = 3
Zookeeper
offset partition 0: 2
offset partition 1: 1
Flink Checkpoint Coordinator
Pending:
Completed:
offsets = 3, 1
offsets = 2, 1 counter = 3
Continue
processing …
c
The system continues with the processing, the counter’s value is consistent across a worker failure.