What’s up With Availability in Kafka? With Justine Olshan | Current 2022

•

0 likes•496 views

What’s up With Availability in Kafka? With Justine Olshan | Current 2022 How do we define and measure availability in a distributed system? A great thing about distributed systems is that they are built to tolerate failures in a way that limits downtime to users. However, this means that availability is a bit more complicated than ""the system is up"" or ""the system is down."" Even if the system is built to tolerate failures, we may see individual components lose availability due to: * cloud provider outages * high latencies * load balancer and/or routing issues * storage failures * hardware issues Using Apache Kafka and Confluent Cloud as a case study, we will dig deeper into how to define good SLOs and SLAs for distributed systems. From there we will discuss ways to improve availability and the changes we made to Confluent Cloud to improve on Kafka's availability story.

Technology

What’s Up with
Availability in Kafka?
Justine Olshan

Imagine this scenario….
I wasn’t able to talk to Apache Kafka®
for 30 minutes!!
What do you mean? The servers were
all up and running.
Well I know that my application was
down! So something was wrong!

How do we deﬁne expectations?
● Service Level Indicator (SLI)
○ A measurement on a service
● Service Level Objective (SLO)
○ A goal for how we want our service to behave
● Service Level Agreement (SLA)
○ An understanding about expectations for the
service
SRE fundamentals 2021: SLIs vs SLAs vs SLOs

Leader: 1 ISR: 1,2,3
Leader: 1 ISR: 1,2,3
Leader: 1 ISR: 1,2,3
Broker 3
Broker 2
Broker 1
How Kafka Prevents Downtime
Leader (1)
Follower (3)
Follower (2)
acks = all
replication.factor = 3
min.insync.replicas = 2
Data Plane: Replication Protocol
Conﬁguring Durability, Availability, and Ordering Guarantees

Leader: 1 ISR: 1,2,3
Leader: 2 ISR: 1,2,3
Broker 3
Broker 2
Broker 1
How Kafka Prevents Downtime
Leader (1)
Follower (3)
Follower (2)
Leader (2)
acks = all
replication.factor = 3
min.insync.replicas = 2

Comparing Shutdowns…
Apache Kafka Made Simple: A First Glimpse of a Kafka Without ZooKeeper
5 Common Pitfalls When Using Apache Kafka

Gaps in Kafka’s Availability Story
● External network connectivity issues
○ Load balancers failing
○ Cloud provider outage
● Storage stuck on leader
● Intermittent issues
● High latency
?

What is up with Kafka?
● Metrics that truly measure availability
○ Can users interact with their data?
● Can we produce/consume? With a cluster or an individual partition?
● Can connections be made?
● Can we replicate?
● From there: deﬁne SLI, SLO, SLA

● Detect misbehaving brokers and take action!
○ Transfer leadership – bin/kafka-reassign-partitions.sh
○ Restart or replace
● Detect misbehaving brokers and take action!
○ Transfer leadership – bin/kafka-reassign-partitions.sh
Leader (1)
Follower (1) Follower (2)
Leader (2)
How can we mitigate unavailability?

Conﬂuent has cool tools in cloud!
● Broker Leadership Priority APIs
● Automatic External Network Mitigation
● Automatic Stuck Storage Mitigation
Note: Conﬂuent Cloud is the only
place to take advantage
of all these availability features!

Promote API
Leader (1)
Follower (1) Follower (2)
Leader (2)
Broker Leadership Priority API
Demote API
DEMOTED!

Automatic External Network Mitigation
Symptoms:
● External (user) connections and trafﬁc lost
● Internal (replication, ZooKeeper) connections and trafﬁc remain
Mitigation:
● Use external trafﬁc and explicit pings
● Automatically demote when external trafﬁc lost
● Automatically promote when external trafﬁc returns

Automatic Stuck Storage Mitigation
Symptoms:
● Storage threads on a leader get stuck, leader can’t replicate
● Followers fall out of ISR
● Leader crashes resulting in ofﬂine partitions
Mitigation:
● Detect when threads get stuck
● Automatically restart the broker, leaders move
● Leadership won’t return unless the broker comes up healthy

Reimagine this scenario….
Our monitoring noticed external
connectivity loss to part of Kafka. We
limited the unavailability by moving
your data to an available part of the
system. Hopefully this caused
minimal downtime for your clients.
Got it. Thanks for keeping my cluster
available and meeting SLA!

Get started with Conﬂuent Cloud to take
advantage of the availability features
mentioned today!
https://developer.conﬂuent.io

Thank you!
Special thanks: Manikumar, Keshav, Gopi, Pablo, Drumil, Lewis,
Adithya

Similar to What’s up With Availability in Kafka? With Justine Olshan | Current 2022

Tips and Tricks for Operating Apache Kafka

All Things Open

Conf2014_SearchHeadClustering

Splunk

The need for gleaning answers from unbounded data streams is moving from nicety to a necessity. Netflix is a data driven company, and has a need to process over 1 trillion events a day amounting to 3 PB of data to derive business insights. To ease extracting insight, we are building a self-serve, scalable, fault-tolerant, multi-tenant "Stream Processing as a Service" platform so the user can focus on data analysis. I'll share our experience using Flink to help build the platform.

Flink forward-2017-netflix keystones-paas

Monal Daxini

Flink Forward San Francisco 2022. At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing. by Xiang Zhang & Pratyush Sharma & Xiaoman Dong

Exactly-Once Financial Data Processing at Scale with Flink and Pinot

Flink Forward

UG-SQL-Server-Internals-Architecture.pptx

bocaha3988

Smart monitoring how does oracle rac manage resource, state ukoug19

Anil Nair

Linux-HA with Pacemaker

Kris Buytaert

Cassandra is pretty awesome, sure I am biased, but it rocks. Always on, tuneable consistency and multi-master architecture? Let’s get our web scale on and build a highly available app that never goes down! Hold on a second. There is one key piece of the puzzle that has a massive impact on your applications availability: the client driver. In this talk we will go through the how to best configure your clients to make the most of failure handling and tuneable consistency in Cassandra.

Client Drivers and Cassandra, the Right Way

DataStax Academy

Building Apps with Distributed In-Memory Computing Using Apache Geode

PivotalOpenSourceHub

Linux-HA with Pacemaker

Kris Buytaert

After reading the topic you are probably asking yourself: “Why I’ve never heard about Cassandra Streams?”. The reason is because Cassandra didn’t have any streams support. Until now. The project started as a preparation to multi regional deployment, so we needed to test Cassandra and answer several simple questions: · what will be the replication lag, or how long will it take for each mutation to propagate? · will we be losing any mutation? · will we see any additional load and/or other problems? · and zillions of other questions To answer them, we decided to integrate Cassandra with Amazon Kinesis, so we can track any individual mutation and analyze replication stats. That is how our Cassandra Streams integration was born. Currently it supports several stream platforms, like Kinesis and Kafka. During this talk you will learn how we did it, how we used it to test Cassandra in Multi-regional setup and what are other possible applications of this concept. About the Speaker Dustin Pham, Sony Dustin is part of the small team who built the core infrastructure which delivered the PlayStation 3 store and then subsequently core services for the PlayStation 4. He has been with Sony for over 4 years and continues to focus on providing entertainment experiences to Sony customers. Dustin is an avid gamer and finds enjoyment in solving large scale problems.

PlayStation and Cassandra Streams (Alexander Filipchik & Dustin Pham, Sony) |...

DataStax

Introduction to apache kafka

Samuel Kerrien

Tracing the Breadcrumbs: Apache Spark Workload Diagnostics

Databricks

"Is your team looking to bring the power of full, end-to-end stream processing with Apache Flink to your organization but are concerned about the time, resources or skills required? In this talk, Sharon Xie, Decodable Founding Engineer and Apache Flink PMC Member, Robert Metzger, will reveal the biggest lessons learned, and how to avoid common mistakes when adopting Apache Flink. If you have any plans on implementing Apache Flink, then this is a session you do not want to miss. We will talk about avoiding data-loss with Flink’s Kafka exactly-once producer, configuring Flink for getting the most bang for the buck out of your memory configuration and tuning for efficient checkpointing."

3 Flink Mistakes We Made So You Won't Have To

HostedbyConfluent

Production Ready. What does it mean? And to whom? Does that term factor in post-launch concerns such as debugability and ownership? What are the lifecycle phases for moving an idea into a hardened production system? As the world continues its furious adoption of automation, Foo-as-a-Service, and ever changing tools, what are the baseline assumptions, risks, checklists, and processes required to support the evolving landscape of "production ready." In this talk we will deploy a sample application and build both a checklist and scorecard to evaluate the readiness of a system and an organization's practices.

Production Readiness Strategies in an Automated World

Sean Chittenden

NetflixOSS Open House Lightning talks

Ruslan Meshenberg

Discover the new features and capabilities of Scylla Open Source 5.0 directly from the engineers who developed it. This second block of lightning talks will cover the following topics: - New IO Scheduler and Disk Parallelism - Per-Service-Level Timeouts - Better Workload Estimation for Backpressure and Out-of-Memory Conditions - Large Partition Handling Improvements - Optimizing Reverse Queries To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.

Scylla Summit 2022: Scylla 5.0 New Features, Part 1

ScyllaDB

https://www.meetup.com/new-york-city-apache-pulsar-meetup/events/283837865/ Learn how to use Apache Pulsar and Apache NiFi to Stream to your Data Lake Discover how to stream data to and from your data lake or data mart using Apache Pulsar™ and Apache NiFi®. Learn how these cloud-native, scalable open-source projects built for streaming data pipelines work together to enable you to quickly build applications with minimal coding. |WHAT THE SESSION WILL COVER| Best Practices for using Pulsar and NiFi A deep dive on Apache NiFi's Pulsar connector and demos Building an End-to-End Application in the Hybrid Cloud Attend for a chance to win a We <3 Pulsar t-shirt! The first 50 registrants who register through here [https://hubs.ly/Q013LTpn0] will be entered in a drawing! —------------------------ |AGENDA| 6:00 - 7:00 PM EST: Presentation - Tim Spann, StreamNative Developer Advocate 7:00 - 8:00 PM EST: Presentation - John Kuchmek, Cloudera Principal Solutions Engineer 8:00 - 8:30 PM EST: Q&A + Networking —------------------------ |ABOUT THE SPEAKERS| John Kuchmek is a Principal Solutions Engineer for Cloudera. Before joining Cloudera, John transitioned to the Autonomous Intelligence team where he was in charge of integrating the platforms to allow data scientists to work with various types of data. Tim Spann is a Developer Advocate for StreamNative. He works with StreamNative Cloud, Apache Pulsar™, Apache Flink®, Flink® SQL, Big Data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science. He is currently working on a book about the FLiP Stack.

[March sn meetup] apache pulsar + apache nifi for cloud data lake

Timothy Spann

Scala & Spark(1.6) in Performance Aspect for Scala Taiwan

Jimin Hsieh

In 2016, we introduced Alibaba’s compute engine Blink which was based on our private branch of flink. It enalbed many large scale applications in Alibaba’s core business, such as search, recommendation and ads. With the deep and close colaboration with the flink community, we are finally close to contribute our improvements back to the flink community. In this talk, we will present our key contributions to flink runtime recently, such as the new YARN cluster mode for Flip-6, fine-grained failover for Flip-1, async i/o for Flip-12, incremental checkpoint, and the further improvements plan from Alibaba in the near future. Moreover, we will show some production use cases to illustrate how flink works in Alibaba’s large scale online applications, which includes real-time ETL as well as online machine learning. This talk is presented by Alibaba.

Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...

Flink Forward

Similar to What’s up With Availability in Kafka? With Justine Olshan | Current 2022 (20)

Tips and Tricks for Operating Apache Kafka

Conf2014_SearchHeadClustering

Flink forward-2017-netflix keystones-paas

Exactly-Once Financial Data Processing at Scale with Flink and Pinot

UG-SQL-Server-Internals-Architecture.pptx

Smart monitoring how does oracle rac manage resource, state ukoug19

Linux-HA with Pacemaker

Client Drivers and Cassandra, the Right Way

Building Apps with Distributed In-Memory Computing Using Apache Geode

Linux-HA with Pacemaker

PlayStation and Cassandra Streams (Alexander Filipchik & Dustin Pham, Sony) |...

Introduction to apache kafka

Tracing the Breadcrumbs: Apache Spark Workload Diagnostics

3 Flink Mistakes We Made So You Won't Have To

Production Readiness Strategies in an Automated World

NetflixOSS Open House Lightning talks

Scylla Summit 2022: Scylla 5.0 New Features, Part 1

[March sn meetup] apache pulsar + apache nifi for cloud data lake

Scala & Spark(1.6) in Performance Aspect for Scala Taiwan

Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...

More from HostedbyConfluent

"In this talk, attendees will be provided with an introduction to Kafka Connect and the basics of Single Message Transforms (SMTs) and how they can be used to transform data streams in a simple and efficient way. SMTs are a powerful feature of Kafka Connect that allow custom logic to be applied to individual messages as they pass through the data pipeline. The session will explain how SMTs work, the types of transformations they can be used for, and how they can be applied in a modular and composable way. Further, the session will discuss where SMTs fit in with Kafka Connect and when they should be used. Examples will be provided of how SMTs can be used to solve common data integration challenges, such as data enrichment, filtering, and restructuring. Attendees will also learn about the limitations of SMTs and when it might be more appropriate to use other tools or frameworks. Additionally, an overview of the alternatives to SMTs, such as Kafka Streams and KSQL, will be provided. This will help attendees make an informed decision about which approach is best for their specific use case. Whether attendees are developers, data engineers, or data scientists, this talk will provide valuable insights into how Kafka Connect and SMTs can help streamline data processing workflows. Attendees will come away with a better understanding of how these tools work and how they can be used to solve common data integration challenges."

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

HostedbyConfluent

"While Apache Kafka lacks native support for topic renaming, there are scenarios where renaming topics becomes necessary. This presentation will delve into the utilization of MirrorMaker 2.0 as a solution for renaming Kafka topics. It will illustrate how MirrorMaker 2.0 can efficiently facilitate the migration of messages from the old topic to the new one and how Kafka Connect Metrics can be employed to monitor the mirroring progress. The discussion will encompass the complexity of renaming Kafka topics, addressing certain limitations, and exploring potential workarounds when using MirrorMaker 2.0 for this purpose. Despite not being originally designed for topic renaming, MirrorMaker 2.0 has a suitable solution for renaming Kafka topics. Blog Post : https://engineering.hellofresh.com/renaming-a-kafka-topic-d6ff3aaf3f03"

Renaming a Kafka Topic | Kafka Summit London

HostedbyConfluent

"Trendyol, Turkey's leading e-commerce company, is committed to positively impacting the lives of millions of customers. Our decision-making processes are entirely driven by data. As a data warehouse team, our primary goal is to provide accurate and up-to-date data, enabling the extraction of valuable business insights. We utilize the benefits provided by Kafka and Kafka Connect to facilitate the transfer of data from the source to our analytical environment. We recently transitioned our Kafka Connect clusters from on-premise VMs to Kubernetes. This shift was driven by our desire to effectively manage rapid growth(marked by a growing number of producers, consumers, and daily messages), ensuring proper monitoring and consistency. Consistency is crucial, especially in instances where we employ Single Message Transforms to manipulate records like filtering based on their keys or converting a JSON Object into a JSON string. Monitoring our cluster's health is key and we achieve this through Grafana dashboards and alerts generated through kube-state-metrics. Additionally, Kafka Connect's JMX metrics, coupled with NewRelic, are employed for comprehensive monitoring. The session will aim to explain our approach to NRT data ingestion, outlining the role of Kafka and Kafka Connect, our transition journey to K8s, and methods employed to monitor the health of our clusters."

Evolution of NRT Data Ingestion Pipeline at Trendyol

HostedbyConfluent

"Join our lightning talk to delve into the strategies vital for maintaining a resilient Kafka service. While proactive monitoring is key for issue prevention, failures will still occur. Rapid detection tools will enable you to identify and resolve problems before they impact end-users. This session explores the techniques employed by Kafka cloud providers for this detection, many of which are also applicable if you are managing independent Kafka clusters or applications. The talk focuses on health-checking, a powerful tool that encompasses an application and its monitoring to validate Kafka environment availability. The session navigates through Kafka health-check methods, sharing best practices, identifying common pitfalls, and highlighting the monitoring of critical performance metrics like throughput and latency for early issue detection. Attendees will gain valuable insights into the art of health-checking their Kafka environment, equipping them with the tools to identify and address issues before they escalate into critical problems. We invite all Kafka enthusiasts to join us in this talk to foster a deeper understanding of Kafka health-checking and ensure the continued smooth operation of your Kafka environment."

Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques

HostedbyConfluent

"Stream processing systems traditionally gave their users the choice between at least once processing and at most once processing: accepting duplicate data or missing data. But ideally we would provide exactly-once processing, where every event in the input data is represented exactly once in the output. Kafka provides a transaction API that enables exactly-once when using Kafka as your source and sink. But this API has turned out to not be well suited for use by high level streaming systems, requiring various work arounds to still provide transactional processing. In this talk, I’ll cover how the transaction API works, and how systems like Arroyo and Flink have used it to build exactly-once support, and how improvements to the transactional API will enable better end-to-end support for consistent stream processing."

Exactly-once Stream Processing with Arroyo and Kafka

HostedbyConfluent

"In this talk, we will explore the exciting world of IoT and computer vision by presenting a unique project: Fish Plays Pokemon. Using an ESP Eye camera connected to an ESP32 and other IoT devices, to monitor fish's movements in an aquarium. This project showcases the power of IoT and computer vision, demonstrating how even a fish can play a popular video game. We will discuss the challenges we faced during development, including real-time processing, IoT device integration, and Kafka message consumption. By the end of the talk, attendees will have a better understanding of how to combine IoT, computer vision, and the usage of a serverless cloud to create innovative projects. They will also learn how to integrate IoT devices with Kafka to simulate keyboard behavior, opening up endless possibilities for real-time interactions between the physical and digital worlds."

Fish Plays Pokemon | Kafka Summit London

HostedbyConfluent

Tiered Storage 101 | Kafla Summit London

HostedbyConfluent

"Real-time 24/7 monitoring and verification of massive data is challenging – even more so for the world’s second largest manufacturer of memory chips and semiconductors. Tolerance levels are incredibly small, any small defect needs to be identified and dealt with immediately. The goal of semiconductor manufacturing is to improve yield and minimize unnecessary work. However, even with real-time data collection, the data was not easy to manipulate by users and it took many days to enable stream processing requests – limiting its usefulness and value to the business. You’ll hear why SK hynix switched to Confluent and how we developed a self-service stream process portal on top of it. Now users have an easy-to-use service to manipulate the data they want. Results have been impressive, stream processing requests are available the same day – previously taking 5 days! We were also able to drive down costs by 10% as stream processing requests no longer require additional hardware. What you’ll take away from our talk: - What were the pain points in the previous environment - How we transitioned to Confluent without service downtime - Creating a self-service stream processing portal built on top of Connect and ksqlDB - Use case of stream process portal"

Building a Self-Service Stream Processing Portal: How And Why

HostedbyConfluent

"Discover how default configurations might impact ingestion times, especially when dealing with large files. We'll explore a real-world scenario with a 20,000,000+ line file, assessing metrics and exploring the bottleneck in the default setup. Understand the intricacies of batch size calculations and how to optimize them based on your unique data characteristics. Walk away with actionable insights as we showcase a practical example, turning a 7-hour ingestion process into a mere 30 minutes for over 30,000,000 records in a Kafka topic. Uncover metrics, configurations, and best practices to elevate the performance of your Kafka Connect CSV source connectors. Don't miss this opportunity to optimize your data pipeline and ensure smooth, efficient data flow."

From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...

HostedbyConfluent

"In order to meet the current and ever-increasing demand for near-zero RPO/RTO systems, a focus on resiliency is critical. While Kafka offers built-in resiliency features, a perfect blend of client and cluster resiliency is necessary in order to achieve a highly resilient Kafka client application. At Fidelity Investments, Kafka is used for a variety of event streaming needs such as core brokerage trading platforms, log aggregation, communication platforms, and data migrations. In this lightening talk, we will discuss the governance framework that has enabled producers and consumers to achieve their SLAs during unprecedented failure scenarios. We will highlight how we automated resiliency tests through chaos engineering and tightly integrated observability dashboards for Kafka clients to analyze and optimize client configurations. And finally, we will summarize the chaos test suite and the ""test, test and test"" mantra that are helping Fidelity Investments reach its goal of a future with zero down-time."

Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...

HostedbyConfluent

"There are various strategies for securely connecting to Kafka clusters between different networks or over the public internet. Many cloud providers even offer endpoints that privately route traffic between networks and are not exposed to the internet. But, depending on your network setup and how you are running Kafka, these options ... might not be an option! In this session, we’ll discuss how you can use SSH bastions or a self managed PrivateLink endpoint to establish connectivity to your Kafka clusters without exposing brokers directly to the internet. We explain the required network configuration, and show how we at Materialize have contributed to librdkafka to simplify these scenarios and avoid fragile workarounds."

Navigating Private Network Connectivity Options for Kafka Clusters

HostedbyConfluent

"In my talk, we will examine all the stages of building our self-service Streaming Data Platform based on Apache Flink and Kafka Connect, from the selection of a solution for stateful streaming data processing, right up to the successful design of a robust self-service platform, covering the challenges that we’ve met. I will share our experience in providing non-Java developers with a company-wide self-service solution, which allows them to quickly and easily develop their streaming data pipelines. Additionally, I will highlight specific business use cases that would not have been implemented without our platform.0 characters0 characters"

Apache Flink: Building a Company-wide Self-service Streaming Data Platform

HostedbyConfluent

"Almost everyone has heard about large language models, and tens of millions of people have tried out OpenAI ChatGPT and Google Bard. However, the intricate architecture and underlying mathematics driving these remarkable systems remain elusive to many. LLM's are fascinating - so let's grab a drink and find out how these systems are built and dive deep into their inner workings. In the length of time it to enjoy a round of drinks, you'll understand the inner workings of these models. We'll take our first sip of word vectors, enjoy the refreshing taste of the transformer, and drain a glass understanding how these models are trained on phenomenally large quantities of data. Large language models for your streaming application - explained with a little maths and a lot of pub stories"

Explaining How Real-Time GenAI Works in a Noisy Pub

HostedbyConfluent

"Monitoring is a fundamental operation when running Kafka and Kafka applications in production. There are numerous metrics available when using Kafka, however the sheer number is overwhelming, making it challenging to know where to start and how to properly utilise them. This session will introduce you to some of the key metrics that should be monitored and best practices in fine tuning your monitoring. We will delve into which metrics are the indicators for cluster’s availability and performance and are the most helpful when debugging client applications."

TL;DR Kafka Metrics | Kafka Summit London

HostedbyConfluent

Kafka Streams relies on state restoration for maintaining standby tasks as failure recovery mechanism as well as for restoring the state after rebalance scenarios. When you are scaling up or down your application instances, it is necessary to know the current state of the restoration process for each active and standby task in order to prevent a long restoration process as much as possible. During this presentation, you will get an understanding of how KIP-869 provides valuable information about the current active task restoration after a rebalance and KIP-988 opens a window to the continuous process of standby restoration. When you encounter a situation in which you need to choose whether or not to scale up or down your application instances, both KIPs will be an invaluable ally for you.

A Window Into Your Kafka Streams Tasks | KSL

HostedbyConfluent

"In this talk, we will dive into the world of Kafka producer configs and explore how to understand and optimize them for better performance. We will cover the different types of configs, their impact on performance, and how to tune them to achieve the best results. Whether you're new to Kafka or a seasoned pro, this session will provide valuable insights and practical tips for improving your Kafka producer performance. - Introduction to Kafka producer internal and workflow - Understanding the producer configs like linger.ms, batch.size, buffer.memory and their impact on performance - Learning about producer configs like max.block.ms, delivery.timeout.ms, request.timeout.ms and retries to make producer more resilient. - Discuss configs like enable.idempotence, max.in.flight.requests.per.connection and transaction related configs to achieve delivery guarantees. - Q&A session with attendees to address specific questions and concerns."

Mastering Kafka Producer Configs: A Guide to Optimizing Performance

HostedbyConfluent

"Data contracts are one of the hottest topics in the data management community. A data contract is a formal agreement between a data producer and its consumers, aimed at reducing data downtime and improving data quality. Schemas are an important part of data contracts, but they are not the only relevant element. In this talk, we’ll: 1. see why data contracts are so important but also difficult to implement; 2. identify the characteristics of a well-designed data contract: discuss the anatomy of a data contract, its main elements and, how to formally describe them; 3. show how to manage the lifecycle of a data contract leveraging Confluent Platform's services."

Data Contracts Management: Schema Registry and Beyond

HostedbyConfluent

"In the realm of stateful stream processing, Apache Flink has emerged as a powerful and versatile platform. However, the conventional SQL-based approach often limits the full potential of Flink applications. We will delve into the benefits of adopting a code-first approach, which provides developers with greater control over application logic, facilitates complex transformations, and enables more efficient handling of state and time. We will also discuss how the code-first approach can lead to more maintainable and testable code, ultimately improving the overall quality of your Flink applications. Whether you're a seasoned Flink developer or just starting your journey, this talk will provide valuable insights into how a code-first approach can revolutionize your stream processing applications."

Code-First Approach: Crafting Efficient Flink Apps

HostedbyConfluent

"Change Data Capture (CDC) has become a commodity in data engineering, much in part due to the ever-rising success of Debezium [1]. But is that all there is? In this lightning talk, we’ll outline the current state of the CDC ecosystem, and understand why adopting a Debezium alternative is still a hard sell. If you’ve ever wondered what else is out there, but can’t keep up with the sprawling of new tools in the ecosystem; we’ll wrap it up for you! [1] https://debezium.io/"

Debezium vs. the World: An Overview of the CDC Ecosystem

HostedbyConfluent

"Separation of compute and storage has become the de-facto standard in the data industry for batch processing. The addition of tiered storage to open source Apache Kafka is the first step in bringing true separation of compute and storage to the streaming world. In this talk, we'll discuss in technical detail how to take the concept of tiered storage to its logical extreme by building an Apache Kafka protocol compatible system that has zero local disks. Eliminating all local disks in the system requires not only separating storage from compute, but also separating data from metadata. This is a monumental task that requires reimagining Kafka's architecture from the ground up, but the benefits are worth it. This approach enables a stateless, elastic, and serverless deployment model that minimizes operational overhead and also drives inter-zone networking costs to almost zero."

Beyond Tiered Storage: Serverless Kafka with No Local Disks

HostedbyConfluent

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

Renaming a Kafka Topic | Kafka Summit London

Evolution of NRT Data Ingestion Pipeline at Trendyol

Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques

Exactly-once Stream Processing with Arroyo and Kafka

Fish Plays Pokemon | Kafka Summit London

Tiered Storage 101 | Kafla Summit London

Building a Self-Service Stream Processing Portal: How And Why

From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...

Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...

Navigating Private Network Connectivity Options for Kafka Clusters

Apache Flink: Building a Company-wide Self-service Streaming Data Platform

Explaining How Real-Time GenAI Works in a Noisy Pub

TL;DR Kafka Metrics | Kafka Summit London

A Window Into Your Kafka Streams Tasks | KSL

Mastering Kafka Producer Configs: A Guide to Optimizing Performance

Data Contracts Management: Schema Registry and Beyond

Code-First Approach: Crafting Efficient Flink Apps

Debezium vs. the World: An Overview of the CDC Ecosystem

Beyond Tiered Storage: Serverless Kafka with No Local Disks

Recently uploaded

How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf

FIDO Alliance

Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf

FIDO Alliance

We're living the AI revolution and Salesforce is adapting and bring new value to their customers. Einstein products are evolving rapidly and navigating their limitations, language support, and use cases can be challenging. Let's make review of what Einstein product are available currently, what are the capabilities and what can be used for in CEE region and how Rossie.ai can help to learn Salesforce speak Czech. We will explore the Einstein roadmap and I will make a short live demo (based on your vote) of some Einstein feature.

AI revolution and Salesforce, Jiří Karpíšek

CzechDreamin

Intrigued by why some of the world's largest companies (Netflix, Google, Cisco, Twitter, Uber etc) are using gRPC? In this demo based talk we delve into the world of gRPC in .Net, what it does and why we should use it. We compare the interface with both Rest and graphQL. We will show you how to implement grpc server-side in .net and in the web. Finally, I will show you how the tooling helps you deliver powerful interfaces and interact with them quickly and simply.

Demystifying gRPC in .Net by John Staveley

John Staveley

Discover the top Symfony development companies that excel in creating robust and scalable web applications. Our latest blog highlights the best firms specializing in Symfony, known for their expertise in delivering high-performance solutions. Whether you’re looking to start a new project or enhance an existing one, these companies offer the skills and experience needed to bring your vision to life. Read more to find your perfect Symfony development partner.

Top 10 Symfony Development Companies 2024

TopCSSGallery

Extensible Python: Robustness through Addition - PyCon 2024

Patrick Viafore

Join me in this session where I'll share our journey of building a fully serverless application that flawlessly managed check-ins for an event with a staggering 80 thousand registrations. We'll dive into three key strategies that made this possible. Firstly, by harnessing DynamoDB global tables, we ensured global service availability and data replication across regions, boosting performance and disaster recovery. Next, we'll explore how we seamlessly integrated real-time updates into the app using Appsync subscriptions, making the experience dynamic and engaging for users. Finally, I'll discuss how provisioned concurrency not only improved performance but also kept costs in check, highlighting the cost-effectiveness of serverless architectures. Through these strategies and the inherent scalability of serverless technology, our application effortlessly handled massive user loads without manual intervention. This session is a real world example to the power and efficiency of modern cloud-based solutions in enabling seamless scalability and robust performance with Serverless

How we scaled to 80K users by doing nothing!.pdf

Srushith Repakula

As an SEO expert specializing in the IPTV and VPN niches with over five years of experience, I navigate the unique challenges of these industries adeptly. My strategic approach encompasses competitive analysis, targeted keyword research, content optimization, and high-quality backlink creation. My goal is to optimize my clients' online visibility, generating targeted organic traffic and maximizing their return on investment. With a results-driven approach and a passion for innovation, I'm poised to assist my clients in thriving in an ever-evolving digital landscape. <a href="https://iptvreel.com">

THE BEST IPTV in GERMANY for 2024: IPTVreel

reely ones

This talk offers actionable insights at an executive level for enhancing productivity and refining your portfolio management approach to propel your organization to greater heights. Key Points Covered: 1. Experience Transformation: - The core challenge remains consistent across organizations: converting budget into user-centric designs. - Strategies for deploying design resources effectively in both startups and large enterprises. 2. Strategic Frameworks: - Introduction to the "Ziggurat of Impact" model, detailing layers from basic system interactions to comprehensive customer experiences. - Practical insights on creating frameworks that scale with organizational complexity. 3. Organizational Impact: - Real-world examples of navigating design in large settings, focusing on the synthesis of consumer products and customer experiences. - Emphasis on the importance of designing systems that directly influence customer interactions. 4. Design Execution: - Detailed walkthrough of organizational layers affecting design execution, from touchpoints and customer activities to shared capabilities. - How to ensure design influences both the micro and macro aspects of customer interactions. 5. Measurement and Adaptation: - Techniques for measuring the impact of design decisions and adapting strategies based on data-driven insights. - The critical role of continuous improvement and feedback in refining customer experiences.

Structuring Teams and Portfolios for Success

UXDXConf

PLAI - Acceleration Program for Generative A.I. Startups

Stefano

This presentation focuses on the challenges and strategies of connecting problem definitions within product development. Key Points Covered: - Kayak's mission since its inception in 2004 to simplify travel by enabling easy comparisons of flights through technological solutions. - Discussion of the complexities within the travel industry, including the high expectations for personalized user experiences and the various stakeholder influences. - Emphasis on the necessity of maintaining agility and innovation within a mature company through continuous reassessment of processes. - An explanation of the importance of disciplined problem definition to prevent project failures and team inefficiencies. - Introduction of strategies for effective communication across teams to ensure alignment and comprehension at all levels of project development. - Exploration of various problem-solving methodologies, including how to handle conflicts within team settings regarding problem definitions and project directions.

Connecting the Dots in Product Design at KAYAK

UXDXConf

Intro in Product Management - Коротко про професію продакт менеджера

Mark Opanasiuk

ScyllaDB has the potential to deliver impressive performance and scalability. The better you understand how it works, the more you can squeeze out of it. But before you squeeze, make sure you know what to monitor! Watch our experienced Postgres developer work through monitoring and performance strategies that help him understand what mistakes he’s made moving to NoSQL. And learn with him as our database performance expert offers friendly guidance on how to use monitoring and performance tuning to get his sample Rust application on the right track. This webinar focuses on using monitoring and performance tuning to discover and correct mistakes that commonly occur when developers move from SQL to NoSQL. For example: - Common issues getting up and running with the monitoring stack - Using the CQL optimizations dashboard - Common issues causing high latency in a node - Common issues causing replica imbalance - What a healthy system looks like in terms of memory - Key metrics to keep an eye on This isn’t “Death-by-Powerpoint.” We’ll walk through problems encountered while migrating a real application from Postgres to ScyllaDB – and try to fix them live as well.

Optimizing NoSQL Performance Through Observability

ScyllaDB

Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf

FIDO Alliance

The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf

FIDO Alliance

Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf

FIDO Alliance

This presentation dives into the practical applications of machine learning within Google's operations, providing a comprehensive overview of how to leverage AI technologies to solve real-world business challenges. Key Points Covered: - Introduction to Machine Learning at Google: Discussion on the role of ML and its evolution in enhancing Google's operational efficiency. - Experience Sharing: Insights into the team's long-term engagement with machine learning projects and the impacts on Google’s operational strategies. - Practical Applications: Real-world examples of ML applications within Google’s daily operations, providing a blueprint to adapt similar strategies. - Challenges and Solutions: Discussion on the challenges faced during the implementation of ML projects and the strategic solutions employed to overcome them. - Future of ML at Google: Insights into future trends in machine learning at Google and how they plan to continue integrating AI into their ecosystem.

Strategic AI Integration in Engineering Teams

UXDXConf

New customer? New industry? New cloud? New team? A lot to handle! How to ensure the success of the project? Start it well! I've created the 3 areas of focus at the beginning of the project that helped me in multiple roles (BA, PO, and Consultant). Learn from real-world experiences and discover how these insights can empower you to deliver unparalleled value to your customers right from the project's start.

Powerful Start- the Key to Project Success, Barbara Laskowska

CzechDreamin

The presentation underscores the strategic advantage of treating design systems not just as technical assets but as vital business components that require thoughtful management, robust planning, and strategic alignment with organizational goals. Key Points Covered: - Understanding Design Systems as Business Entities: Conceptualizing design systems as internal business entities can streamline their integration and evolution within a company. - Adoption and Expansion: Elaborating on the importance of tactical adoption across organizational structures, enhancing product suites to cater to user needs and broadening scope to mobile and content authoring solutions. - Data-Driven Development: Utilizing data insights for component development ensures that resources are allocated to create valuable, widely used features. - Financial Modeling for Design Systems: Developing sustainable funding models is crucial for long-term support and success of design systems. - Promoting Internal Buy-In: Stressing on strategies for promoting design systems within the organization to increase engagement and investment from internal stakeholders.

A Business-Centric Approach to Design System Strategy

UXDXConf

The standard Salesforce Approval process can be limiting in many ways, especially in complex scenarios. What if there was a way to implement very flexible approvals where one can use Apex code to make data updates in unrelated records, dynamically generate next steps details, and compute assignees on the fly? And still use UI-based configurations to implement concrete approval processes. In this session, we will share ideas behind such a solution and show a few lines of code to get you started.

Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder

CzechDreamin

Recently uploaded (20)

How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf

Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf

AI revolution and Salesforce, Jiří Karpíšek

Demystifying gRPC in .Net by John Staveley

Top 10 Symfony Development Companies 2024

Extensible Python: Robustness through Addition - PyCon 2024

How we scaled to 80K users by doing nothing!.pdf

THE BEST IPTV in GERMANY for 2024: IPTVreel

Structuring Teams and Portfolios for Success

PLAI - Acceleration Program for Generative A.I. Startups

Connecting the Dots in Product Design at KAYAK

Intro in Product Management - Коротко про професію продакт менеджера

Optimizing NoSQL Performance Through Observability

Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf

The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf

Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf

Strategic AI Integration in Engineering Teams

Powerful Start- the Key to Project Success, Barbara Laskowska

A Business-Centric Approach to Design System Strategy

Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder

What’s up With Availability in Kafka? With Justine Olshan | Current 2022

1. What’s Up with Availability in Kafka? Justine Olshan

2. Imagine this scenario…. I wasn’t able to talk to Apache Kafka® for 30 minutes!! What do you mean? The servers were all up and running. Well I know that my application was down! So something was wrong!

3. How do we deﬁne expectations? ● Service Level Indicator (SLI) ○ A measurement on a service ● Service Level Objective (SLO) ○ A goal for how we want our service to behave ● Service Level Agreement (SLA) ○ An understanding about expectations for the service SRE fundamentals 2021: SLIs vs SLAs vs SLOs

4. Leader: 1 ISR: 1,2,3 Leader: 1 ISR: 1,2,3 Leader: 1 ISR: 1,2,3 Broker 3 Broker 2 Broker 1 How Kafka Prevents Downtime Leader (1) Follower (3) Follower (2) acks = all replication.factor = 3 min.insync.replicas = 2 Data Plane: Replication Protocol Conﬁguring Durability, Availability, and Ordering Guarantees

5. Leader: 1 ISR: 1,2,3 Leader: 2 ISR: 1,2,3 Broker 3 Broker 2 Broker 1 How Kafka Prevents Downtime Leader (1) Follower (3) Follower (2) Leader (2) acks = all replication.factor = 3 min.insync.replicas = 2

6. Comparing Shutdowns… Apache Kafka Made Simple: A First Glimpse of a Kafka Without ZooKeeper 5 Common Pitfalls When Using Apache Kafka

7. Gaps in Kafka’s Availability Story ● External network connectivity issues ○ Load balancers failing ○ Cloud provider outage ● Storage stuck on leader ● Intermittent issues ● High latency ?

8. What is up with Kafka? ● Metrics that truly measure availability ○ Can users interact with their data? ● Can we produce/consume? With a cluster or an individual partition? ● Can connections be made? ● Can we replicate? ● From there: deﬁne SLI, SLO, SLA

9. ● Detect misbehaving brokers and take action! ○ Transfer leadership – bin/kafka-reassign-partitions.sh ○ Restart or replace ● Detect misbehaving brokers and take action! ○ Transfer leadership – bin/kafka-reassign-partitions.sh Leader (1) Follower (1) Follower (2) Leader (2) How can we mitigate unavailability?

10. Conﬂuent has cool tools in cloud! ● Broker Leadership Priority APIs ● Automatic External Network Mitigation ● Automatic Stuck Storage Mitigation Note: Conﬂuent Cloud is the only place to take advantage of all these availability features!

11. Promote API Leader (1) Follower (1) Follower (2) Leader (2) Broker Leadership Priority API Demote API DEMOTED!

12. Automatic External Network Mitigation Symptoms: ● External (user) connections and traffic lost ● Internal (replication, ZooKeeper) connections and traffic remain Mitigation: ● Use external traffic and explicit pings ● Automatically demote when external traffic lost ● Automatically promote when external traffic returns

13. Automatic Stuck Storage Mitigation Symptoms: ● Storage threads on a leader get stuck, leader can’t replicate ● Followers fall out of ISR ● Leader crashes resulting in ofﬂine partitions Mitigation: ● Detect when threads get stuck ● Automatically restart the broker, leaders move ● Leadership won’t return unless the broker comes up healthy

14. Reimagine this scenario…. Our monitoring noticed external connectivity loss to part of Kafka. We limited the unavailability by moving your data to an available part of the system. Hopefully this caused minimal downtime for your clients. Got it. Thanks for keeping my cluster available and meeting SLA!

15. Get started with Conﬂuent Cloud to take advantage of the availability features mentioned today! https://developer.conﬂuent.io

16. Thank you! Special thanks: Manikumar, Keshav, Gopi, Pablo, Drumil, Lewis, Adithya

What’s up With Availability in Kafka? With Justine Olshan | Current 2022

Recommended

Recommended

More Related Content

Similar to What’s up With Availability in Kafka? With Justine Olshan | Current 2022

Similar to What’s up With Availability in Kafka? With Justine Olshan | Current 2022 (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

What’s up With Availability in Kafka? With Justine Olshan | Current 2022