Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Apache Kafka is an open-source distributed event streaming platform used for building real-time data pipelines and streaming apps. It was developed by LinkedIn in 2011 to solve problems with data integration and processing. Kafka uses a publish-subscribe messaging model and is designed to be fast, scalable, and durable. It allows both streaming and storage of data and acts as a central data backbone for large organizations.
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
Why is Kafka so fast? Why is Kafka so popular? Why Kafka? This slide deck is a tutorial for the Kafka streaming platform. This slide deck covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example to demonstrate failover of brokers as well as consumers. Then it goes through some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have also expanded on the Kafka design section and added references. The tutorial covers Avro and the Schema Registry as well as advance Kafka Producers.
This document discusses Apache Kafka, an open-source distributed event streaming platform. It provides an introduction to Kafka's design and capabilities including:
1) Kafka is a distributed publish-subscribe messaging system that can handle high throughput workloads with low latency.
2) It is designed for real-time data pipelines and activity streaming and can be used for transporting logs, metrics collection, and building real-time applications.
3) Kafka supports distributed, scalable, fault-tolerant storage and processing of streaming data across multiple producers and consumers.
Kafka is an open source messaging system that can handle massive streams of data in real-time. It is fast, scalable, durable, and fault-tolerant. Kafka is commonly used for stream processing, website activity tracking, metrics collection, and log aggregation. It supports high throughput, reliable delivery, and horizontal scalability. Some examples of real-time use cases for Kafka include website monitoring, network monitoring, fraud detection, and IoT applications.
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
Flink Forward San Francisco 2022.
With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi.
by
Ethan Guo & Kyle Weller
Can and should Apache Kafka replace a database? How long can and should I store data in Kafka? How can I query and process data in Kafka? These are common questions that come up more and more. This session explains the idea behind databases and different features like storage, queries, transactions, and processing to evaluate when Kafka is a good fit and when it is not.
The discussion includes different Kafka-native add-ons like Tiered Storage for long-term, cost-efficient storage and ksqlDB as event streaming database. The relation and trade-offs between Kafka and other databases are explored to complement each other instead of thinking about a replacement. This includes different options for pull and push-based bi-directional integration.
Key takeaways:
- Kafka can store data forever in a durable and high available manner
- Kafka has different options to query historical data
- Kafka-native add-ons like ksqlDB or Tiered Storage make Kafka more powerful than ever before to store and process data
- Kafka does not provide transactions, but exactly-once semantics
- Kafka is not a replacement for existing databases like MySQL, MongoDB or Elasticsearch
- Kafka and other databases complement each other; the right solution has to be selected for a problem
- Different options are available for bi-directional pull and push-based integration between Kafka and databases to complement each other
Video Recording:
https://youtu.be/7KEkWbwefqQ
Blog post:
https://www.kai-waehner.de/blog/2020/03/12/can-apache-kafka-replace-database-acid-storage-transactions-sql-nosql-data-lake/
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Apache Kafka is an open-source distributed event streaming platform used for building real-time data pipelines and streaming apps. It was developed by LinkedIn in 2011 to solve problems with data integration and processing. Kafka uses a publish-subscribe messaging model and is designed to be fast, scalable, and durable. It allows both streaming and storage of data and acts as a central data backbone for large organizations.
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
Why is Kafka so fast? Why is Kafka so popular? Why Kafka? This slide deck is a tutorial for the Kafka streaming platform. This slide deck covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example to demonstrate failover of brokers as well as consumers. Then it goes through some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have also expanded on the Kafka design section and added references. The tutorial covers Avro and the Schema Registry as well as advance Kafka Producers.
This document discusses Apache Kafka, an open-source distributed event streaming platform. It provides an introduction to Kafka's design and capabilities including:
1) Kafka is a distributed publish-subscribe messaging system that can handle high throughput workloads with low latency.
2) It is designed for real-time data pipelines and activity streaming and can be used for transporting logs, metrics collection, and building real-time applications.
3) Kafka supports distributed, scalable, fault-tolerant storage and processing of streaming data across multiple producers and consumers.
Kafka is an open source messaging system that can handle massive streams of data in real-time. It is fast, scalable, durable, and fault-tolerant. Kafka is commonly used for stream processing, website activity tracking, metrics collection, and log aggregation. It supports high throughput, reliable delivery, and horizontal scalability. Some examples of real-time use cases for Kafka include website monitoring, network monitoring, fraud detection, and IoT applications.
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
Flink Forward San Francisco 2022.
With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi.
by
Ethan Guo & Kyle Weller
Can and should Apache Kafka replace a database? How long can and should I store data in Kafka? How can I query and process data in Kafka? These are common questions that come up more and more. This session explains the idea behind databases and different features like storage, queries, transactions, and processing to evaluate when Kafka is a good fit and when it is not.
The discussion includes different Kafka-native add-ons like Tiered Storage for long-term, cost-efficient storage and ksqlDB as event streaming database. The relation and trade-offs between Kafka and other databases are explored to complement each other instead of thinking about a replacement. This includes different options for pull and push-based bi-directional integration.
Key takeaways:
- Kafka can store data forever in a durable and high available manner
- Kafka has different options to query historical data
- Kafka-native add-ons like ksqlDB or Tiered Storage make Kafka more powerful than ever before to store and process data
- Kafka does not provide transactions, but exactly-once semantics
- Kafka is not a replacement for existing databases like MySQL, MongoDB or Elasticsearch
- Kafka and other databases complement each other; the right solution has to be selected for a problem
- Different options are available for bi-directional pull and push-based integration between Kafka and databases to complement each other
Video Recording:
https://youtu.be/7KEkWbwefqQ
Blog post:
https://www.kai-waehner.de/blog/2020/03/12/can-apache-kafka-replace-database-acid-storage-transactions-sql-nosql-data-lake/
Producer Performance Tuning for Apache KafkaJiangjie Qin
Kafka is well known for high throughput ingestion. However, to get the best latency characteristics without compromising on throughput and durability, we need to tune Kafka. In this talk, we share our experiences to achieve the optimal combination of latency, throughput and durability for different scenarios.
Uber has one of the largest Kafka deployment in the industry. To improve the scalability and availability, we developed and deployed a novel federated Kafka cluster setup which hides the cluster details from producers/consumers. Users do not need to know which cluster a topic resides and the clients view a "logical cluster". The federation layer will map the clients to the actual physical clusters, and keep the location of the physical cluster transparent from the user. Cluster federation brings us several benefits to support our business growth and ease our daily operation. In particular, Client control. Inside Uber there are a large of applications and clients on Kafka, and it's challenging to migrate a topic with live consumers between clusters. Coordinations with the users are usually needed to shift their traffic to the migrated cluster. Cluster federation enables much control of the clients from the server side by enabling consumer traffic redirection to another physical cluster without restarting the application. Scalability: With federation, the Kafka service can horizontally scale by adding more clusters when a cluster is full. The topics can freely migrate to a new cluster without notifying the users or restarting the clients. Moreover, no matter how many physical clusters we manage per topic type, from the user perspective, they view only one logical cluster. Availability: With a topic replicated to at least two clusters we can tolerate a single cluster failure by redirecting the clients to the secondary cluster without performing a region-failover. This also provides much freedom and alleviates the risks for us to carry out important maintenance on a critical cluster. Before the maintenance, we mark the cluster as a secondary and migrate off the live traffic and consumers. We will present the details of the architecture and several interesting technical challenges we overcame.
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...Kai Wähner
Architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments
Multi-cluster and cross-data center deployments of Apache Kafka have become the norm rather than an exception. This session gives an overview of several scenarios that may require multi-cluster solutions and discusses real-world examples with their specific requirements and trade-offs, including disaster recovery, aggregation for analytics, cloud migration, mission-critical stretched deployments and global Kafka.
Key takeaways:
In many scenarios, one Kafka cluster is not enough. Understand different architectures and alternatives for multi-cluster deployments.
Zero data loss and high availability are two key requirements. Understand how to realize this, including trade-offs.
Learn about features and limitations of Kafka for multi cluster deployments
Global Kafka and mission-critical multi-cluster deployments with zero data loss and high availability became the normal, not an exception.
This document provides an introduction to Apache Kafka, an open-source distributed event streaming platform. It discusses Kafka's history as a project originally developed by LinkedIn, its use cases like messaging, activity tracking and stream processing. It describes key Kafka concepts like topics, partitions, offsets, replicas, brokers and producers/consumers. It also gives examples of how companies like Netflix, Uber and LinkedIn use Kafka in their applications and provides a comparison to Apache Spark.
Apache Kafka Streams + Machine Learning / Deep LearningKai Wähner
This document discusses applying machine learning models to real-time stream processing using Apache Kafka. It covers building analytic models from historical data, applying those models to real-time streams without redevelopment, and techniques for online training of models. Live demos are presented using open source tools like Kafka Streams, Kafka Connect, and H2O to apply machine learning to streaming use cases like flight delay prediction. The key takeaway is that streaming platforms can leverage pre-built machine learning models to power real-time analytics and actions.
Running Apache Kafka in production is only the first step in the Kafka operations journey. Professional Kafka users are ready to handle all possible disasters - because for most businesses having a disaster recovery plan is not optional.
In this session, we’ll discuss disaster scenarios that can take down entire Kafka clusters and share advice on how to plan, prepare and handle these events. This is a technical session full of best practices - we want to make sure you are ready to handle the worst mayhem that nature and auditors can cause.
Visit www.confluent.io for more information.
Kafka is a distributed publish-subscribe messaging system that allows both streaming and storage of data feeds. It is designed to be fast, scalable, durable, and fault-tolerant. Kafka maintains feeds of messages called topics that can be published to by producers and subscribed to by consumers. A Kafka cluster typically runs on multiple servers called brokers that store topics which may be partitioned and replicated for fault tolerance. Producers publish messages to topics which are distributed to consumers through consumer groups that balance load.
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
Apache Kafka is a distributed publish-subscribe messaging system that can handle high volumes of data and enable messages to be passed from one endpoint to another. It uses a distributed commit log that allows messages to be persisted on disk for durability. Kafka is fast, scalable, fault-tolerant, and guarantees zero data loss. It is used by companies like LinkedIn, Twitter, and Netflix to handle high volumes of real-time data and streaming workloads.
Haitao Zhang, Uber, Software Engineer + Yang Yang, Uber, Senior Software Engineer
Kafka Consumer Proxy is a forwarding proxy that consumes messages from Kafka and dispatches them to a user registered gRPC service endpoint. With Kafka Consumer Proxy, the experience of consuming messages from Apache Kafka for pub-sub use cases is as seamless and user-friendly as receiving (g)RPC requests. In this talk, we will share (1) the motivation for building this service, (2) the high-level architecture, (3) the mechanisms we designed to achieve high availability, scalability, and reliability, and (4) the current adoption status.
https://www.meetup.com/KafkaBayArea/events/273834934/
Kafka is a real-time, fault-tolerant, scalable messaging system.
It is a publish-subscribe system that connects various applications with the help of messages - producers and consumers of information.
Presentation at Strata Data Conference 2018, New York
The controller is the brain of Apache Kafka. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure.
Jun Rao outlines the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker.
Jun then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg.
Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS.
Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg.
You will learn:
• The issues that arise when using the Hive table format at scale, and why we need a new table format
• How a straightforward, elegant change in table format structure has enormous positive effects
• The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it
• The resulting benefits of this architectural design
Presented at the inaugural Kafka summit (2016) hosted by Confluent in San Francisco
Abstract:
Kafka is a backbone for various data pipelines and asynchronous messaging at LinkedIn and beyond. 2015 was an exciting year at LinkedIn in that we hit a new level of scale with Kafka: we now process more than 1 trillion published messages per day across nearly 1300 brokers. We run into some interesting production issues at this scale and I will dive into some of the most critical incidents that we encountered at LinkedIn in the past year:
Data loss: We have extremely stringent SLAs on latency and completeness that were violated on a few occasions. Some of these incidents were due to subtle configuration problems or even missing features.
Offset resets: As of early 2015, Kafka-based offset management was still a relatively new feature and we occasionally hit offset resets. Troubleshooting these incidents turned out to be extremely tricky and resulted in various fixes in offset management/log compaction as well as our monitoring.
Cluster unavailability due to high request/response latencies: Such incidents demonstrate how even subtle performance regressions and monitoring gaps can lead to an eventual cluster meltdown.
Power failures! What happens when an entire data center goes down? We experienced this first hand and it was not so pretty.
and more…
This talk will go over how we detected, investigated and remediated each of these issues and summarize some of the features in Kafka that we are working on that will help eliminate or mitigate such incidents in the future.
The art of the event streaming application: streams, stream processors and sc...confluent
The document discusses event streaming applications and microservices. It introduces event streaming as an architectural style where applications are composed of loosely coupled services that communicate asynchronously through streams of events. Key aspects covered include handling state using event streams and Kafka Streams, building applications as bounded contexts with choreography and orchestration, and establishing pillars for instrumentation, control and operations. Overall the document promotes event streaming as a paradigm that addresses complexity by providing simplicity and scalability through convergent data and logic processing.
Producer Performance Tuning for Apache KafkaJiangjie Qin
Kafka is well known for high throughput ingestion. However, to get the best latency characteristics without compromising on throughput and durability, we need to tune Kafka. In this talk, we share our experiences to achieve the optimal combination of latency, throughput and durability for different scenarios.
Uber has one of the largest Kafka deployment in the industry. To improve the scalability and availability, we developed and deployed a novel federated Kafka cluster setup which hides the cluster details from producers/consumers. Users do not need to know which cluster a topic resides and the clients view a "logical cluster". The federation layer will map the clients to the actual physical clusters, and keep the location of the physical cluster transparent from the user. Cluster federation brings us several benefits to support our business growth and ease our daily operation. In particular, Client control. Inside Uber there are a large of applications and clients on Kafka, and it's challenging to migrate a topic with live consumers between clusters. Coordinations with the users are usually needed to shift their traffic to the migrated cluster. Cluster federation enables much control of the clients from the server side by enabling consumer traffic redirection to another physical cluster without restarting the application. Scalability: With federation, the Kafka service can horizontally scale by adding more clusters when a cluster is full. The topics can freely migrate to a new cluster without notifying the users or restarting the clients. Moreover, no matter how many physical clusters we manage per topic type, from the user perspective, they view only one logical cluster. Availability: With a topic replicated to at least two clusters we can tolerate a single cluster failure by redirecting the clients to the secondary cluster without performing a region-failover. This also provides much freedom and alleviates the risks for us to carry out important maintenance on a critical cluster. Before the maintenance, we mark the cluster as a secondary and migrate off the live traffic and consumers. We will present the details of the architecture and several interesting technical challenges we overcame.
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...Kai Wähner
Architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments
Multi-cluster and cross-data center deployments of Apache Kafka have become the norm rather than an exception. This session gives an overview of several scenarios that may require multi-cluster solutions and discusses real-world examples with their specific requirements and trade-offs, including disaster recovery, aggregation for analytics, cloud migration, mission-critical stretched deployments and global Kafka.
Key takeaways:
In many scenarios, one Kafka cluster is not enough. Understand different architectures and alternatives for multi-cluster deployments.
Zero data loss and high availability are two key requirements. Understand how to realize this, including trade-offs.
Learn about features and limitations of Kafka for multi cluster deployments
Global Kafka and mission-critical multi-cluster deployments with zero data loss and high availability became the normal, not an exception.
This document provides an introduction to Apache Kafka, an open-source distributed event streaming platform. It discusses Kafka's history as a project originally developed by LinkedIn, its use cases like messaging, activity tracking and stream processing. It describes key Kafka concepts like topics, partitions, offsets, replicas, brokers and producers/consumers. It also gives examples of how companies like Netflix, Uber and LinkedIn use Kafka in their applications and provides a comparison to Apache Spark.
Apache Kafka Streams + Machine Learning / Deep LearningKai Wähner
This document discusses applying machine learning models to real-time stream processing using Apache Kafka. It covers building analytic models from historical data, applying those models to real-time streams without redevelopment, and techniques for online training of models. Live demos are presented using open source tools like Kafka Streams, Kafka Connect, and H2O to apply machine learning to streaming use cases like flight delay prediction. The key takeaway is that streaming platforms can leverage pre-built machine learning models to power real-time analytics and actions.
Running Apache Kafka in production is only the first step in the Kafka operations journey. Professional Kafka users are ready to handle all possible disasters - because for most businesses having a disaster recovery plan is not optional.
In this session, we’ll discuss disaster scenarios that can take down entire Kafka clusters and share advice on how to plan, prepare and handle these events. This is a technical session full of best practices - we want to make sure you are ready to handle the worst mayhem that nature and auditors can cause.
Visit www.confluent.io for more information.
Kafka is a distributed publish-subscribe messaging system that allows both streaming and storage of data feeds. It is designed to be fast, scalable, durable, and fault-tolerant. Kafka maintains feeds of messages called topics that can be published to by producers and subscribed to by consumers. A Kafka cluster typically runs on multiple servers called brokers that store topics which may be partitioned and replicated for fault tolerance. Producers publish messages to topics which are distributed to consumers through consumer groups that balance load.
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
Apache Kafka is a distributed publish-subscribe messaging system that can handle high volumes of data and enable messages to be passed from one endpoint to another. It uses a distributed commit log that allows messages to be persisted on disk for durability. Kafka is fast, scalable, fault-tolerant, and guarantees zero data loss. It is used by companies like LinkedIn, Twitter, and Netflix to handle high volumes of real-time data and streaming workloads.
Haitao Zhang, Uber, Software Engineer + Yang Yang, Uber, Senior Software Engineer
Kafka Consumer Proxy is a forwarding proxy that consumes messages from Kafka and dispatches them to a user registered gRPC service endpoint. With Kafka Consumer Proxy, the experience of consuming messages from Apache Kafka for pub-sub use cases is as seamless and user-friendly as receiving (g)RPC requests. In this talk, we will share (1) the motivation for building this service, (2) the high-level architecture, (3) the mechanisms we designed to achieve high availability, scalability, and reliability, and (4) the current adoption status.
https://www.meetup.com/KafkaBayArea/events/273834934/
Kafka is a real-time, fault-tolerant, scalable messaging system.
It is a publish-subscribe system that connects various applications with the help of messages - producers and consumers of information.
Presentation at Strata Data Conference 2018, New York
The controller is the brain of Apache Kafka. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure.
Jun Rao outlines the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker.
Jun then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg.
Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS.
Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg.
You will learn:
• The issues that arise when using the Hive table format at scale, and why we need a new table format
• How a straightforward, elegant change in table format structure has enormous positive effects
• The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it
• The resulting benefits of this architectural design
Presented at the inaugural Kafka summit (2016) hosted by Confluent in San Francisco
Abstract:
Kafka is a backbone for various data pipelines and asynchronous messaging at LinkedIn and beyond. 2015 was an exciting year at LinkedIn in that we hit a new level of scale with Kafka: we now process more than 1 trillion published messages per day across nearly 1300 brokers. We run into some interesting production issues at this scale and I will dive into some of the most critical incidents that we encountered at LinkedIn in the past year:
Data loss: We have extremely stringent SLAs on latency and completeness that were violated on a few occasions. Some of these incidents were due to subtle configuration problems or even missing features.
Offset resets: As of early 2015, Kafka-based offset management was still a relatively new feature and we occasionally hit offset resets. Troubleshooting these incidents turned out to be extremely tricky and resulted in various fixes in offset management/log compaction as well as our monitoring.
Cluster unavailability due to high request/response latencies: Such incidents demonstrate how even subtle performance regressions and monitoring gaps can lead to an eventual cluster meltdown.
Power failures! What happens when an entire data center goes down? We experienced this first hand and it was not so pretty.
and more…
This talk will go over how we detected, investigated and remediated each of these issues and summarize some of the features in Kafka that we are working on that will help eliminate or mitigate such incidents in the future.
The art of the event streaming application: streams, stream processors and sc...confluent
The document discusses event streaming applications and microservices. It introduces event streaming as an architectural style where applications are composed of loosely coupled services that communicate asynchronously through streams of events. Key aspects covered include handling state using event streams and Kafka Streams, building applications as bounded contexts with choreography and orchestration, and establishing pillars for instrumentation, control and operations. Overall the document promotes event streaming as a paradigm that addresses complexity by providing simplicity and scalability through convergent data and logic processing.
Kafka summit SF 2019 - the art of the event-streaming appNeil Avery
Have you ever imagined what it would be like to build a massively scalable streaming application on Kafka, the challenges, the patterns and the thought process involved? How much of the application can be reused? What patterns will you discover? How does it all fit together? Depending upon your use case and business, this can mean many things. Starting out with a data pipeline is one thing, but evolving into a company-wide real-time application that is business critical and entirely dependent upon a streaming platform is a giant leap. Large-scale streaming applications are also called event streaming applications. They are classically different from other data systems; event streaming applications are viewed as a series of interconnected streams that are topologically defined using stream processors; they hold state that models your use case as events. Almost like a deconstructed realtime database.
In this talk, I step through the origins of event streaming systems, understanding how they are developed from raw events to evolve into something that can be adopted at an organizational scale. I start with event-first thinking, Domain Driven Design to build data models that work with the fundamentals of Streams, Kafka Streams, KSQL and Serverless (FaaS). Building upon this, I explain how to build common business functionality by stepping through patterns for Scalable payment processing Run it on rails: Instrumentation and monitoring Control flow patterns (start, stop, pause) Finally, all of these concepts are combined in a solution architecture that can be used at enterprise scale. I will introduce enterprise patterns such as events-as-a-backbone, events as APIs and methods for governance and self-service. You will leave talk with an understanding of how to model events with event-first thinking, how to work towards reusable streaming patterns and most importantly, how it all fits together at scale.
Cruise Control: Effortless management of Kafka clustersPrateek Maheshwari
Kafka has become the de facto standard for streaming data with high-throughput, low-latency, and fault-tolerance. However, its rising adoption raises new challenges. In particular, the growing cluster sizes, increasing volume and diversity of user traffic, and aging network and server components induce an overhead in managing the system. This overhead makes it infeasible for human operators to constantly monitor, identify, and mitigate issues. The resulting utilization imbalance across brokers leads to unpredictable client performance due to the high variation in their throughput and latency. Finally, properly expanding, shrinking, or upgrading clusters also incurs a management overhead. Hence, adopting a principled approach to manage Kafka clusters is integral to the sustainability of the infrastructure.
This talk will describe how LinkedIn alleviates the management overhead of large-scale Kafka clusters using Cruise Control. To this end, first, we will discuss the reactive and proactive techniques that Cruise Control uses to support admin operations for cluster maintenance, enable anomaly detection with self-healing, and provide real-time monitoring for Kafka clusters. Next, we will examine how Cruise Control performs in production. Finally, we will conclude with questions and further discussion.
The AZ-104 exam dumps of DumpsBase coming with actual questions and answers contain everything you need to know for attending the actual AZ-104 Microsoft Azure Administrator exam. These verified Microsoft AZ-104 exam dumps that you can use to increase your preparation for the actual AZ-104 exam. These AZ-104 test questions will help you pass the exam on your first try easily and comfortably. We're talking about AZ-104 exam dumps (V26.02) in PDF format and free software that you can download in preparation for the Microsoft AZ-104 exam. #AZ-104 Dumps
The document describes the BAT Buffer Pool (BBP) in MonetDB. The BBP manages BATs (Binary Association Tables) in memory and persists them to disk. It handles administration, lookup, persistence, buffer management, recovery, unloading, reference counting, sharing, and synchronization of BATs. The BBP stores BAT metadata in a BBP.dir file and uses locks to control concurrent access to the in-memory BBP array.
VMworld 2013: Part 2: How to Build a Self-Healing Data Center with vCenter Or...VMworld
VMworld 2013
Nicholas Colyer, Catamaran RX
Dan Mitchell, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
The document discusses best practices for using Apache Cassandra, including:
- Topology considerations like replication strategies and snitches
- Booting new datacenters and replacing nodes
- Security techniques like authentication, authorization, and SSL encryption
- Using prepared statements for efficiency
- Asynchronous execution for request pipelining
- Batch statements and their appropriate uses
- Improving performance through techniques like the new row cache
Presented at SF Big Analytics Meetup
Online event processing applications often require the ability to ingest, store, dispatch and process events. Until now, supporting all of these needs has required different systems for each task -- stream processing engines, messaging queuing middleware, and pub/sub messaging systems. This has led to the unnecessary complexity for the development of such applications and operations leading to increased barrier to adoption in the enterprises. In this talk, Karthik will outline the need to unify these capabilities in a single system and make it easy to develop and operate at scale. Karthik will delve into how Apache Pulsar was designed to address this need with an elegant architecture. Apache Pulsar is a next generation distributed pub-sub system that was originally developed and deployed at Yahoo and running in production in more than 100+ companies. Karthik will explain how the architecture and design of Pulsar provides the flexibility to support developers and applications needing any combination of queuing, messaging, streaming and lightweight compute for events. Furthermore, he will provide real life use cases how Apache Pulsar is used for event processing ranging from data processing tasks to web processing applications.
Behind the Code 'September 2022 // by ExnessMaxim Gaponov
The presentation discussed the evolution of hacks from 2009 to the present. Key hacks included the RockYou breach in 2009, Stuxnet in 2010, Shamoon in 2012, emergence of IoT hacks in 2014, Visa-Mastercard breach in 2015, WannaCry in 2017, Bangladesh Bank heist in 2016, and the FireEye hack in 2021. The presentation concluded that people will continue to be the main source of attacks, defenses need to move from reactive to resilient approaches, cloud security is important, and more interconnections will create new vulnerabilities as technology evolves.
This document discusses microservices architecture and related technologies like Docker and Kubernetes. It provides an overview of how microservices break up monolithic applications into independent, scalable services. It also describes how Docker containerizes services and how Kubernetes provides automation, scaling, and management of containerized services running across a cluster. Logging and monitoring services like ELK/EFK are also summarized as ways to aggregate and analyze logs from microservices.
Kafka is evolving to remove its dependency on Zookeeper. The Kafka Improvement Proposal 500 (KIP-500) aims to manage Kafka's metadata log with a self-managed Raft consensus algorithm and controller quorum rather than relying on Zookeeper. This will improve scalability, robustness, and make deployment easier. It will take multiple releases to fully implement KIP-500, beginning with removing Zookeeper from clients and ending with a release where Zookeeper is no longer required.
Architecting Microservices Applications with Instant Analyticsconfluent
View recording here: https://www.confluent.io/online-talks/architecting-microservices-applications-with-instant-analytics
The next generation architecture for exploring and visualizing event-driven data in real-time requires the right technology. Microservices deliver significant deployment and development agility, but raise questions of how data will move between services and how it will be analyzed. This online talk explores how Apache Druid and Apache Kafka® can turn a microservices ecosystem into a distributed real-time application with instant analytics. Apache Kafka and Druid form the backbone of an architecture that meet the demands imposed on the next generation applications you are building right now. Join industry experts Tim Berglund, Confluent, and Rachel Pedreschi, Imply, as they discuss architecting microservices apps with Druid and Apache Kafka.
Understanding and Extending Prometheus AlertManagerLee Calcote
The document discusses Prometheus AlertManager, including its purpose of ingesting, grouping, deduplicating, silencing, throttling, and notifying alerts from Prometheus. It describes AlertManager's routes, receivers, silencers, inhibitors, and grouping functionality. It also covers high availability, the AlertManager UI, and enhancing AlertManager to support viewing alert history.
ContainerDays Boston 2016: "Autopilot: Running Real-world Applications in Con...DynamicInfraDays
Slides from Tim Gross's talk "Autopilot: Running Real-world Applications in Containers" at ContainerDays Boston 2016: http://dynamicinfradays.org/events/2016-boston/programme.html#autopilot
The document provides information about IBM's Vulnerability Advisor tool for analyzing container images and instances for security vulnerabilities and policy violations. It discusses how the tool provides deep visibility into images and instances by collecting various data types and using annotators to analyze the data and provide operational insights. It also describes how the tool can help users identify vulnerable or non-compliant images, detect systems with weak passwords or password access configurations, and provide a vulnerability report with details on discovered issues and policy violations.
VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...VMworld
VMworld 2013
Darryl Hing, VMware Canada
Jacy Townsend, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
VMworld 2013: Troubleshooting at Cox Communications with VMware vCenter Log I...VMworld
VMworld 2013
Chris Nakagaki, Cox Communications
Jason Davis, Cox Communications
Himanshu Kumar Singh, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
The document summarizes research on assessing the scalability of microservice architectures. It discusses how microservices introduce challenges for monitoring performance and reliability due to their decentralized nature. The researcher aims to develop approaches to identify bottlenecks, anomalies, and anti-patterns in microservices. The document outlines a framework called PPTAM that generates load tests to analyze the performance of different architectural configurations and identifies the most scalable option based on success rates under various workloads. Ongoing work also looks to recognize common anti-patterns that can degrade microservice performance.
This presentation recounts the story of Macys.com and Bloomingdales.com's migration from legacy RDBMS to NoSQL Cassandra in partnership with DataStax.
One thing that differentiates this talk from others on Cassandra is Macy's philosophy of "doing more with less." You will see why we emphasize the performance tuning aspects of iterative development when you see how much processing we can support on relatively small configurations.
This session will cover:
1) The process that led to our decision to use Cassandra
2) The approach we used for migrating from DB2 & Coherence to Cassandra without disrupting the production environment
3) The various schema options that we tried and how we settled on the current one. We'll show you a selection of some of our extensive performance tuning benchmarks, as well as how these performance results figured into our final schema designs.
4) Our lessons learned and next steps
Similar to Consumer offset management in Kafka (20)
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
4. (Don’t) store offsets in ZooKeeper
• Heavy write-load on ZooKeeper
• Especially an issue
– during 0.7 to 0.8 migration
– and before we switched to SSDs
• Non-ideal work-arounds
– Increase offset-commit intervals
– Filter commits if offsets have not moved
– Spread large offset commits over commit
interval
5. Offset management (ideals)
• Durable
• Support high write-load
• Consistent reads
• Atomic offset commits
• Fast commits/fetches
6. Store offsets in a replicated log
audit-consumer
PageViewEvent-0
240
audit-consumer
EmailBounceEvent-0
232
__consumer_offsets Next commit
Group
Offset
Partition
7. Store offsets in a replicated log
audit-consumer
PageViewEvent-0
240
audit-consumer
EmailBounceEvent-0
232
__consumer_offsets
audit-consumer
EmailBounceEvent-0
248
8. Store offsets in a replicated log
audit-consumer
PageViewEvent-0
240
audit-consumer
EmailBounceEvent-0
232
__consumer_offsets
audit-consumer
EmailBounceEvent-0
248
audit-consumer
PageViewEvent-0
323
9. Store offsets in a replicated log
audit-consumer
PageViewEvent-0
240
audit-consumer
EmailBounceEvent-0
232
__consumer_offsets
audit-consumer
EmailBounceEvent-0
248
audit-consumer
PageViewEvent-0
323
mirrormaker
ClickEvent-0
54543
29. Offset{Commit,Fetch} API
OffsetFetchRequest
o Group Id: String
o Partitions: List<Topic-partition>
OffsetFetchResponse
o Response map
§ Key è Topic-partition
§ Value è Partition-data
• Offset: Long
• Metadata: String
• Error code: Short
31. Offset{Commit,Fetch} API
KafkaConsumer<K, V> consumer = new KafkaConsumer<K, V>(properties);!
…!
TopicPartition partition1 = new TopicPartition("topic1", 0);!
TopicPartition partition1 = new TopicPartition("topic1", 1);!
!
consumer.subscribe(partition1, partition2);!
!
Map<TopicPartition, Long> offsets = new LinkedHashMap<TopicPartition,
Long>();!
offsets.put(partition1, 123L);!
offsets.put(partition2, 4320L);!
…!
// commit offsets!
consumer.commit(offsets, CommitType.SYNC);!
…!
// fetch offsets!
long committedOffset = consumer.committed(partition1);!
!
32. How to read the offsets topic
To read everything, use the console consumer!
./bin/kafka-console-consumer.sh --topic __consumer_offsets --
zookeeper localhost:2181 --formatter "kafka.server.OffsetManager
$OffsetsMessageFormatter" --consumer.config config/
consumer.properties!
(Must set exclude.internal.topics = false in consumer.properties)
!
To read a single partition, use the simple-
consumer-shell
./bin/kafka-simple-consumer-shell.sh --topic __consumer_offsets --
partition 12 --broker-list localhost:9092 --formatter
"kafka.server.OffsetManager$OffsetsMessageFormatter"!
34. How to migrate/roll-back
Migrate from ZooKeeper to Kafka:
• Config change
– offsets.storage=kafka
– dual.commit.enabled=true
• Rolling bounce
• Config change
– dual.commit.enabled=false
• Rolling bounce
35. How to migrate/roll-back
Migrate from Kafka to ZooKeeper:
• Config change
– dual.commit.enabled=true
• Rolling bounce
• Config change
– offsets.storage=zookeeper
– dual.commit.enabled=false
• Rolling bounce
36. Key metrics to monitor
• Consumer mbeans
– Kafka commit rate
– ZooKeeper commit rate (during migration)
• Broker mbeans
– Max-dirty ratio and other log cleaner metrics
– Offset cache size
– Group count
– {ConsumerMetadata, OffsetCommit, OffsetFetch}
request metrics
37. 0.8.3
• Support compression in compacted topics
(KAFKA-1734)
• Change offset commit “timestamp” to
mean retention period: KAFKA-1634
• Offset client