This document provides an overview of Apache Kafka. It begins with defining Kafka as a distributed streaming platform and messaging system. It then lists the agenda which includes what Kafka is, why it is used, common use cases, major companies that use it, how it achieves high performance, and core concepts. Core concepts explained include topics, partitions, brokers, replication, leaders, and producers and consumers. The document also provides examples to illustrate these concepts.
The document provides an introduction and overview of Apache Kafka presented by Jeff Holoman. It begins with an agenda and background on the presenter. It then covers basic Kafka concepts like topics, partitions, producers, consumers and consumer groups. It discusses efficiency and delivery guarantees. Finally, it presents some use cases for Kafka and positioning around when it may or may not be a good fit compared to other technologies.
Apache Kafka is a distributed streaming platform used for building real-time data pipelines and streaming apps. It provides a unified, scalable, and durable platform for handling real-time data feeds. Kafka works by accepting streams of records from one or more producers and organizing them into topics. It allows both storing and forwarding of these streams to consumers. Producers write data to topics which are replicated across clusters for fault tolerance. Consumers can then read the data from the topics in the order it was produced. Major companies like LinkedIn, Yahoo, Twitter, and Netflix use Kafka for applications like metrics, logging, stream processing and more.
Kafka is an open source messaging system that can handle massive streams of data in real-time. It is fast, scalable, durable, and fault-tolerant. Kafka is commonly used for stream processing, website activity tracking, metrics collection, and log aggregation. It supports high throughput, reliable delivery, and horizontal scalability. Some examples of real-time use cases for Kafka include website monitoring, network monitoring, fraud detection, and IoT applications.
Apache Kafka is a distributed publish-subscribe messaging system that allows for high throughput, low latency data ingestion and distribution. It provides reliability through replication, scalability by partitioning topics across brokers, and durability by persisting messages to disk. Common uses of Kafka include metrics collection, log aggregation, and stream processing using frameworks like Spark Streaming. Kafka's architecture includes brokers that store topics which are partitions distributed across a cluster, with ZooKeeper for coordination. Producers write messages to topics and consumers read messages in a subscriber model.
Apache Kafka Fundamentals for Architects, Admins and Developersconfluent
This document summarizes a presentation about Apache Kafka. It introduces Apache Kafka as a modern, distributed platform for data streams made up of distributed, immutable, append-only commit logs. It describes Kafka's scalability similar to a filesystem and guarantees similar to a database, with the ability to rewind and replay data. The document discusses Kafka topics and partitions, partition leadership and replication, and provides resources for further information.
Apache Kafka is a distributed messaging system that allows for publishing and subscribing to streams of records, known as topics, in a fault-tolerant and scalable way. It is used for building real-time data pipelines and streaming apps. Producers write data to topics which are committed to disks across partitions and replicated for fault tolerance. Consumers read data from topics in a decoupled manner based on offsets. Kafka can process streaming data in real-time and at large volumes with low latency and high throughput.
Kafka is a distributed publish-subscribe messaging system that allows both streaming and storage of data feeds. It is designed to be fast, scalable, durable, and fault-tolerant. Kafka maintains feeds of messages called topics that can be published to by producers and subscribed to by consumers. A Kafka cluster typically runs on multiple servers called brokers that store topics which may be partitioned and replicated for fault tolerance. Producers publish messages to topics which are distributed to consumers through consumer groups that balance load.
Kafka is an open-source distributed commit log service that provides high-throughput messaging functionality. It is designed to handle large volumes of data and different use cases like online and offline processing more efficiently than alternatives like RabbitMQ. Kafka works by partitioning topics into segments spread across clusters of machines, and replicates across these partitions for fault tolerance. It can be used as a central data hub or pipeline for collecting, transforming, and streaming data between systems and applications.
The document provides an introduction and overview of Apache Kafka presented by Jeff Holoman. It begins with an agenda and background on the presenter. It then covers basic Kafka concepts like topics, partitions, producers, consumers and consumer groups. It discusses efficiency and delivery guarantees. Finally, it presents some use cases for Kafka and positioning around when it may or may not be a good fit compared to other technologies.
Apache Kafka is a distributed streaming platform used for building real-time data pipelines and streaming apps. It provides a unified, scalable, and durable platform for handling real-time data feeds. Kafka works by accepting streams of records from one or more producers and organizing them into topics. It allows both storing and forwarding of these streams to consumers. Producers write data to topics which are replicated across clusters for fault tolerance. Consumers can then read the data from the topics in the order it was produced. Major companies like LinkedIn, Yahoo, Twitter, and Netflix use Kafka for applications like metrics, logging, stream processing and more.
Kafka is an open source messaging system that can handle massive streams of data in real-time. It is fast, scalable, durable, and fault-tolerant. Kafka is commonly used for stream processing, website activity tracking, metrics collection, and log aggregation. It supports high throughput, reliable delivery, and horizontal scalability. Some examples of real-time use cases for Kafka include website monitoring, network monitoring, fraud detection, and IoT applications.
Apache Kafka is a distributed publish-subscribe messaging system that allows for high throughput, low latency data ingestion and distribution. It provides reliability through replication, scalability by partitioning topics across brokers, and durability by persisting messages to disk. Common uses of Kafka include metrics collection, log aggregation, and stream processing using frameworks like Spark Streaming. Kafka's architecture includes brokers that store topics which are partitions distributed across a cluster, with ZooKeeper for coordination. Producers write messages to topics and consumers read messages in a subscriber model.
Apache Kafka Fundamentals for Architects, Admins and Developersconfluent
This document summarizes a presentation about Apache Kafka. It introduces Apache Kafka as a modern, distributed platform for data streams made up of distributed, immutable, append-only commit logs. It describes Kafka's scalability similar to a filesystem and guarantees similar to a database, with the ability to rewind and replay data. The document discusses Kafka topics and partitions, partition leadership and replication, and provides resources for further information.
Apache Kafka is a distributed messaging system that allows for publishing and subscribing to streams of records, known as topics, in a fault-tolerant and scalable way. It is used for building real-time data pipelines and streaming apps. Producers write data to topics which are committed to disks across partitions and replicated for fault tolerance. Consumers read data from topics in a decoupled manner based on offsets. Kafka can process streaming data in real-time and at large volumes with low latency and high throughput.
Kafka is a distributed publish-subscribe messaging system that allows both streaming and storage of data feeds. It is designed to be fast, scalable, durable, and fault-tolerant. Kafka maintains feeds of messages called topics that can be published to by producers and subscribed to by consumers. A Kafka cluster typically runs on multiple servers called brokers that store topics which may be partitioned and replicated for fault tolerance. Producers publish messages to topics which are distributed to consumers through consumer groups that balance load.
Kafka is an open-source distributed commit log service that provides high-throughput messaging functionality. It is designed to handle large volumes of data and different use cases like online and offline processing more efficiently than alternatives like RabbitMQ. Kafka works by partitioning topics into segments spread across clusters of machines, and replicates across these partitions for fault tolerance. It can be used as a central data hub or pipeline for collecting, transforming, and streaming data between systems and applications.
Kafka is an open-source message broker that provides high-throughput and low-latency data processing. It uses a distributed commit log to store messages in categories called topics. Processes that publish messages are producers, while processes that subscribe to topics are consumers. Consumers can belong to consumer groups for parallel processing. Kafka guarantees order and no lost messages. It uses Zookeeper for metadata and coordination.
Introducing Apache Kafka - a visual overview. Presented at the Canberra Big Data Meetup 7 February 2019. We build a Kafka "postal service" to explain the main Kafka concepts, and explain how consumers receive different messages depending on whether there's a key or not.
Apache Kafka is a fast, scalable, durable and distributed messaging system. It is designed for high throughput systems and can replace traditional message brokers. Kafka has better throughput, partitioning, replication and fault tolerance compared to other messaging systems, making it suitable for large-scale applications. Kafka persists all data to disk for reliability and uses distributed commit logs for durability.
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
Why is Kafka so fast? Why is Kafka so popular? Why Kafka? This slide deck is a tutorial for the Kafka streaming platform. This slide deck covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example to demonstrate failover of brokers as well as consumers. Then it goes through some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have also expanded on the Kafka design section and added references. The tutorial covers Avro and the Schema Registry as well as advance Kafka Producers.
Producer Performance Tuning for Apache KafkaJiangjie Qin
Kafka is well known for high throughput ingestion. However, to get the best latency characteristics without compromising on throughput and durability, we need to tune Kafka. In this talk, we share our experiences to achieve the optimal combination of latency, throughput and durability for different scenarios.
This document provides an introduction to Apache Kafka. It describes Kafka as a distributed messaging system with features like durability, scalability, publish-subscribe capabilities, and ordering. It discusses key Kafka concepts like producers, consumers, topics, partitions and brokers. It also summarizes use cases for Kafka and how to implement producers and consumers in code. Finally, it briefly outlines related tools like Kafka Connect and Kafka Streams that build upon the Kafka platform.
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision TreeSlim Baltagi
Kafka as a streaming data platform is becoming the successor to traditional messaging systems such as RabbitMQ. Nevertheless, there are still some use cases where they could be a good fit. This one single slide tries to answer in a concise and unbiased way where to use Apache Kafka and where to use RabbitMQ. Your comments and feedback are much appreciated.
Hello, kafka! (an introduction to apache kafka)Timothy Spann
Hello ApacheKafka
An Introduction to Apache Kafka with Timothy Spann and Carolyn Duby Cloudera Principal engineers.
We also demo Flink SQL, SMM, SSB, Schema Registry, Apache Kafka, Apache NiFi and Public Cloud - AWS.
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
Apache Kafka is a distributed publish-subscribe messaging system that allows for high-throughput, persistent storage of messages. It provides decoupling of data pipelines by allowing producers to write messages to topics that can then be read from by multiple consumer applications in a scalable, fault-tolerant way. Key aspects of Kafka include topics for categorizing messages, partitions for scaling and parallelism, replication for redundancy, and producers and consumers for writing and reading messages.
This document provides an introduction to Apache Kafka, an open-source distributed event streaming platform. It discusses Kafka's history as a project originally developed by LinkedIn, its use cases like messaging, activity tracking and stream processing. It describes key Kafka concepts like topics, partitions, offsets, replicas, brokers and producers/consumers. It also gives examples of how companies like Netflix, Uber and LinkedIn use Kafka in their applications and provides a comparison to Apache Spark.
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
Kafka is a distributed messaging system that allows for publishing and subscribing to streams of records, known as topics. Producers write data to topics and consumers read from topics. The data is partitioned and replicated across clusters of machines called brokers for reliability and scalability. A common data format like Avro can be used to serialize the data.
Kafka is a real-time, fault-tolerant, scalable messaging system.
It is a publish-subscribe system that connects various applications with the help of messages - producers and consumers of information.
ksqlDB is a stream processing SQL engine, which allows stream processing on top of Apache Kafka. ksqlDB is based on Kafka Stream and provides capabilities for consuming messages from Kafka, analysing these messages in near-realtime with a SQL like language and produce results again to a Kafka topic. By that, no single line of Java code has to be written and you can reuse your SQL knowhow. This lowers the bar for starting with stream processing significantly.
ksqlDB offers powerful capabilities of stream processing, such as joins, aggregations, time windows and support for event time. In this talk I will present how KSQL integrates with the Kafka ecosystem and demonstrate how easy it is to implement a solution using ksqlDB for most part. This will be done in a live demo on a fictitious IoT sample.
Full recorded presentation at https://www.youtube.com/watch?v=2UfAgCSKPZo for Tetrate Tech Talks on 2022/05/13.
Envoy's support for Kafka protocol, in form of broker-filter and mesh-filter.
Contents:
- overview of Kafka (usecases, partitioning, producer/consumer, protocol);
- proxying Kafka (non-Envoy specific);
- proxying Kafka with Envoy;
- handling Kafka protocol in Envoy;
- Kafka-broker-filter for per-connection proxying;
- Kafka-mesh-filter to provide front proxy for multiple Kafka clusters.
References:
- https://adam-kotwasinski.medium.com/deploying-envoy-and-kafka-8aa7513ec0a0
- https://adam-kotwasinski.medium.com/kafka-mesh-filter-in-envoy-a70b3aefcdef
Apache Kafka is a distributed publish-subscribe messaging system that can handle high volumes of data and enable messages to be passed from one endpoint to another. It uses a distributed commit log that allows messages to be persisted on disk for durability. Kafka is fast, scalable, fault-tolerant, and guarantees zero data loss. It is used by companies like LinkedIn, Twitter, and Netflix to handle high volumes of real-time data and streaming workloads.
Building zero data loss pipelines with apache kafkaAvinash Ramineni
Kafka is playing an increasingly important role in messaging and streaming systems and is becoming the defacto messaging platform in many enterprises. Managing and maintaining Kafka deployments and tuning the data pipelines for high-performance and scalability can become a challenging task.
In this session, we will discuss the lessons learned and the best practices for achieving zero data loss pipelines.
This document provides an overview of Apache Kafka. It describes Kafka as a distributed publish-subscribe messaging system with a distributed commit log that provides high-throughput and low-latency processing of streaming data. The document covers Kafka concepts like topics, partitions, producers, consumers, replication, and reliability guarantees. It also discusses Kafka architecture, performance optimizations, configuration parameters for durability and reliability, and use cases for activity tracking, messaging, metrics, and stream processing.
Kafka is an open-source message broker that provides high-throughput and low-latency data processing. It uses a distributed commit log to store messages in categories called topics. Processes that publish messages are producers, while processes that subscribe to topics are consumers. Consumers can belong to consumer groups for parallel processing. Kafka guarantees order and no lost messages. It uses Zookeeper for metadata and coordination.
Introducing Apache Kafka - a visual overview. Presented at the Canberra Big Data Meetup 7 February 2019. We build a Kafka "postal service" to explain the main Kafka concepts, and explain how consumers receive different messages depending on whether there's a key or not.
Apache Kafka is a fast, scalable, durable and distributed messaging system. It is designed for high throughput systems and can replace traditional message brokers. Kafka has better throughput, partitioning, replication and fault tolerance compared to other messaging systems, making it suitable for large-scale applications. Kafka persists all data to disk for reliability and uses distributed commit logs for durability.
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
Why is Kafka so fast? Why is Kafka so popular? Why Kafka? This slide deck is a tutorial for the Kafka streaming platform. This slide deck covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example to demonstrate failover of brokers as well as consumers. Then it goes through some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have also expanded on the Kafka design section and added references. The tutorial covers Avro and the Schema Registry as well as advance Kafka Producers.
Producer Performance Tuning for Apache KafkaJiangjie Qin
Kafka is well known for high throughput ingestion. However, to get the best latency characteristics without compromising on throughput and durability, we need to tune Kafka. In this talk, we share our experiences to achieve the optimal combination of latency, throughput and durability for different scenarios.
This document provides an introduction to Apache Kafka. It describes Kafka as a distributed messaging system with features like durability, scalability, publish-subscribe capabilities, and ordering. It discusses key Kafka concepts like producers, consumers, topics, partitions and brokers. It also summarizes use cases for Kafka and how to implement producers and consumers in code. Finally, it briefly outlines related tools like Kafka Connect and Kafka Streams that build upon the Kafka platform.
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision TreeSlim Baltagi
Kafka as a streaming data platform is becoming the successor to traditional messaging systems such as RabbitMQ. Nevertheless, there are still some use cases where they could be a good fit. This one single slide tries to answer in a concise and unbiased way where to use Apache Kafka and where to use RabbitMQ. Your comments and feedback are much appreciated.
Hello, kafka! (an introduction to apache kafka)Timothy Spann
Hello ApacheKafka
An Introduction to Apache Kafka with Timothy Spann and Carolyn Duby Cloudera Principal engineers.
We also demo Flink SQL, SMM, SSB, Schema Registry, Apache Kafka, Apache NiFi and Public Cloud - AWS.
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
Apache Kafka is a distributed publish-subscribe messaging system that allows for high-throughput, persistent storage of messages. It provides decoupling of data pipelines by allowing producers to write messages to topics that can then be read from by multiple consumer applications in a scalable, fault-tolerant way. Key aspects of Kafka include topics for categorizing messages, partitions for scaling and parallelism, replication for redundancy, and producers and consumers for writing and reading messages.
This document provides an introduction to Apache Kafka, an open-source distributed event streaming platform. It discusses Kafka's history as a project originally developed by LinkedIn, its use cases like messaging, activity tracking and stream processing. It describes key Kafka concepts like topics, partitions, offsets, replicas, brokers and producers/consumers. It also gives examples of how companies like Netflix, Uber and LinkedIn use Kafka in their applications and provides a comparison to Apache Spark.
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
Kafka is a distributed messaging system that allows for publishing and subscribing to streams of records, known as topics. Producers write data to topics and consumers read from topics. The data is partitioned and replicated across clusters of machines called brokers for reliability and scalability. A common data format like Avro can be used to serialize the data.
Kafka is a real-time, fault-tolerant, scalable messaging system.
It is a publish-subscribe system that connects various applications with the help of messages - producers and consumers of information.
ksqlDB is a stream processing SQL engine, which allows stream processing on top of Apache Kafka. ksqlDB is based on Kafka Stream and provides capabilities for consuming messages from Kafka, analysing these messages in near-realtime with a SQL like language and produce results again to a Kafka topic. By that, no single line of Java code has to be written and you can reuse your SQL knowhow. This lowers the bar for starting with stream processing significantly.
ksqlDB offers powerful capabilities of stream processing, such as joins, aggregations, time windows and support for event time. In this talk I will present how KSQL integrates with the Kafka ecosystem and demonstrate how easy it is to implement a solution using ksqlDB for most part. This will be done in a live demo on a fictitious IoT sample.
Full recorded presentation at https://www.youtube.com/watch?v=2UfAgCSKPZo for Tetrate Tech Talks on 2022/05/13.
Envoy's support for Kafka protocol, in form of broker-filter and mesh-filter.
Contents:
- overview of Kafka (usecases, partitioning, producer/consumer, protocol);
- proxying Kafka (non-Envoy specific);
- proxying Kafka with Envoy;
- handling Kafka protocol in Envoy;
- Kafka-broker-filter for per-connection proxying;
- Kafka-mesh-filter to provide front proxy for multiple Kafka clusters.
References:
- https://adam-kotwasinski.medium.com/deploying-envoy-and-kafka-8aa7513ec0a0
- https://adam-kotwasinski.medium.com/kafka-mesh-filter-in-envoy-a70b3aefcdef
Apache Kafka is a distributed publish-subscribe messaging system that can handle high volumes of data and enable messages to be passed from one endpoint to another. It uses a distributed commit log that allows messages to be persisted on disk for durability. Kafka is fast, scalable, fault-tolerant, and guarantees zero data loss. It is used by companies like LinkedIn, Twitter, and Netflix to handle high volumes of real-time data and streaming workloads.
Building zero data loss pipelines with apache kafkaAvinash Ramineni
Kafka is playing an increasingly important role in messaging and streaming systems and is becoming the defacto messaging platform in many enterprises. Managing and maintaining Kafka deployments and tuning the data pipelines for high-performance and scalability can become a challenging task.
In this session, we will discuss the lessons learned and the best practices for achieving zero data loss pipelines.
This document provides an overview of Apache Kafka. It describes Kafka as a distributed publish-subscribe messaging system with a distributed commit log that provides high-throughput and low-latency processing of streaming data. The document covers Kafka concepts like topics, partitions, producers, consumers, replication, and reliability guarantees. It also discusses Kafka architecture, performance optimizations, configuration parameters for durability and reliability, and use cases for activity tracking, messaging, metrics, and stream processing.
Fundamentals and Architecture of Apache KafkaAngelo Cesaro
Fundamentals and Architecture of Apache Kafka.
This presentation explains Apache Kafka's architecture and internal design giving an overview of Kafka internal functions, including:
Brokers, Replication, Partitions, Producers, Consumers, Commit log, comparison over traditional message queues.
This document discusses the evolution of Kafka clusters at AppsFlyer over time. The initial cluster had 4 brokers and handled hundreds of millions of messages with low partitioning and replication. A new cluster was designed with more brokers, replication across availability zones, and higher partitioning to support billions of messages. However, this led to issues like uneven leader distribution and failures. Various solutions were implemented like increasing brokers, splitting topics, and hardware upgrades. Ongoing testing and monitoring helped identify more problems and improvements around replication, partitioning, and automation. Key lessons learned included balancing replication and leaders, supporting dynamic changes, and thorough testing of failure scenarios.
This session goes through the understanding of Apache Kafka, its components and working with best practices to achieve fault tolerant system with high availability and consistency by tuning Kafka brokers and producer to achieve the best result.
Apache Kafka is a distributed streaming platform that allows for publishing and subscribing to streams of records. It uses a broker system and partitions topics to allow for scaling and parallelism. LinkedIn's Camus is a MapReduce job that moves data from Kafka to HDFS in distributed fashion. It consists of three stages: setup, the MapReduce job, and cleanup.
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
Keystone - Processing over Half a Trillion events per day with 8 million events & 17 GB per second peaks, and at-least once processing semantics. We will explore in detail how we employ Kafka, Samza, and Docker at scale to implement a multi-tenant pipeline. We will also look at the evolution to its current state and where the pipeline is headed next in offering a self-service stream processing infrastructure atop the Kafka based pipeline and support Spark Streaming.
Presentation from kafka meetup 13-SEP-2013. including some notes to clarify some slides. enjoy
Avi Levi
123avi@gmail.com
https://www.linkedin.com/in/leviavi/
This document provides an introduction and overview of Apache Kafka. It discusses Kafka's core concepts including producers, consumers, topics, partitions and brokers. It also covers how to install and run Kafka, producer and consumer configuration settings, and how data is distributed in a Kafka cluster. Examples of creating topics, producing and consuming messages are also included.
Intro to Apache Kafka I gave at the Big Data Meetup in Geneva in June 2016. Covers the basics and gets into some more advanced topics. Includes demo and source code to write clients and unit tests in Java (GitHub repo on the last slides).
Apache Kafka's Common Pitfalls & Intricacies: A Customer Support PerspectiveHostedbyConfluent
"As Apache Kafka gains widespread adoption, an increasing number of people face its pitfalls. Despite completing courses and reading documentation, many encounter hurdles navigating Kafka's subtle complexities.
Join us for an enlightening session led by the customer support team of Conduktor, where we engage daily with users grappling with Kafka's subtleties. We've observed recurring themes in user queries: What happens when a consumer group rebalances? What is an advertised listener? Why aren't my records displayed in chronological order when I consume them? How does retention work?
For all these questions, the answer is ""It depends"". In this talk, we aim to demystify these uncertainties by presenting nuanced scenarios for each query. That way you will be more confident on how your Kafka infrastructure works behind the scenes, and you'll be equipped to share this knowledge with your colleagues. By being aware of the most common misconceptions, you should be able to both speed up your own learning curve and also help others more effectively."
Apache Kafka is a distributed messaging system that provides fast, highly scalable messaging through a publish-subscribe model. It was built at LinkedIn as a central hub for messaging between systems and focuses on scalability and fault tolerance. Kafka uses a distributed commit log architecture with topics that are partitioned for scalability and parallelism. It provides high throughput and fault tolerance through replication and an in-sync replica set.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
This document provides an overview of structured streaming with Kafka in Spark. It discusses data collection vs ingestion and why they are key. It also covers Kafka architecture and terminology. It describes how Spark integrates with Kafka for streaming data sources. It explains checkpointing in structured streaming and using Kafka as a sink. The document discusses delivery semantics and how Spark supports exactly-once semantics with certain output stores. Finally, it outlines new features in Kafka for exactly-once guarantees and the future of structured streaming.
Linked In Stream Processing Meetup - Apache PulsarKarthik Ramasamy
Apache Pulsar is a fast, highly scalable, and flexible pub/sub messaging system. It provides guaranteed message delivery, ordering, and durability by backing messages with a replicated log storage. Pulsar's architecture allows for independent scalability of brokers and storage nodes. It supports multi-tenancy, geo-replication, and high throughput of over 1.8 million messages per second in a single partition.
Timothy will introduce Apache Pulsar, an open-source distributed messaging and streaming platform. He will discuss how to build real-time applications using Pulsar with various libraries, schemas, languages, frameworks and tools. The presentation will cover what Pulsar is, its functions and components, how it compares to other technologies like Apache Kafka, its advantages, and how to integrate it with tools like Apache Flink, Apache Spark, Apache NiFi and more. A demo and Q&A will follow.
Uber has one of the largest Kafka deployment in the industry. To improve the scalability and availability, we developed and deployed a novel federated Kafka cluster setup which hides the cluster details from producers/consumers. Users do not need to know which cluster a topic resides and the clients view a "logical cluster". The federation layer will map the clients to the actual physical clusters, and keep the location of the physical cluster transparent from the user. Cluster federation brings us several benefits to support our business growth and ease our daily operation. In particular, Client control. Inside Uber there are a large of applications and clients on Kafka, and it's challenging to migrate a topic with live consumers between clusters. Coordinations with the users are usually needed to shift their traffic to the migrated cluster. Cluster federation enables much control of the clients from the server side by enabling consumer traffic redirection to another physical cluster without restarting the application. Scalability: With federation, the Kafka service can horizontally scale by adding more clusters when a cluster is full. The topics can freely migrate to a new cluster without notifying the users or restarting the clients. Moreover, no matter how many physical clusters we manage per topic type, from the user perspective, they view only one logical cluster. Availability: With a topic replicated to at least two clusters we can tolerate a single cluster failure by redirecting the clients to the secondary cluster without performing a region-failover. This also provides much freedom and alleviates the risks for us to carry out important maintenance on a critical cluster. Before the maintenance, we mark the cluster as a secondary and migrate off the live traffic and consumers. We will present the details of the architecture and several interesting technical challenges we overcame.
This document provides an introduction to Apache Kafka. It discusses why Kafka is needed for real-time streaming data processing and real-time analytics. It also outlines some of Kafka's key features like scalability, reliability, replication, and fault tolerance. The document summarizes common use cases for Kafka and examples of large companies that use it. Finally, it describes Kafka's core architecture including topics, partitions, producers, consumers, and how it integrates with Zookeeper.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
20240609 QFM020 Irresponsible AI Reading List May 2024
Apache Kafka
1.
2. Agenda
1. What is kafka ?
2. Why Kafka ?
3. Kafka Use Cases
4. Who use Kafka ?
5. Why is Kafka so fast ?
6. Kafka Core Concept (Theory)
7. Kafka CLI 101
3. What is Kafka ?
At the beginning …
“ ... a publish/subscribe
messaging system...”
4. What is Kafka ?
... today …
“ ... a stream data platform ...”
5. What is Kafka ?
... but at the core …
“ ... a distributed, horizontally-
scalable, fault-tolerant ...”
6. What is Kafka ?
● Developed at Linkedin back in 2010, open source in 2011
● Designed to be fast, scalable, durable and available
● Used to decouple of data stream and system
● Distributed by nature (cluster)
● Resilient architecture
● Fault tolerant
● High throughput / low latency
● Ability to handle huge number of consumers
7. Why Kafka ?
● Great performance (low latency < 10 ms)
● Horizontal scalable (can add more node on cluster)
● Fault tolerant storage
○ Replicates topic log partitions to multiple server
● Stable, Reliable, Durability
● Robust Replication (no data lost)
10. Kafka use cases
● Messaging
○ As the “traditional” messaging system
● Website activity tracking
○ Event like page views, search
● Metrics collection and Monitoring
○ Alerting and reporting on operational metrics
● Log aggregation
○ Collect logs from multiple service
● Stream processing
○ Read, process and write stream for real-time analysis
11. Who use Kafka?
● LinkedIn use kafka to monitoring activity data and operational metrics
● Uber uses Kafka to gather user, taxi and trip data in real-time to compute and
forecast surge pricing in real-time
● Netflix uses kafka to apply recommendation in real-time while you ‘re
watching TV-Show
12. Why kafka so fast ?
● Zero Copy - calls the OS kernel direct rather to move data fast
● Batch Data in Chunks - Batches data into chunks
○ End to end from Producer to file system to Consumer with minimises cross machine latency
○ Provides More efficient data compression. Reduces I/O latency
● Sequential Disk Write - Avoids Random Disk Access
○ Writes to immutable commit log. No slow disk seeking. No random I/O Operations.
○ Disk accessed in sequential manner
● Horizontal Scale - uses 100s to thousands of partitions for single topic
○ Spread out to thousands of servers
○ Handle massive load
14. Kafka Core Concept - Topic and Partitions
Topic :-
● Similar to table in database (but no reference id)
● Each topic identified by name (Unique Key)
Partitions :-
● Topic split into partitions, and each partition is ordered
● Each message in partition is assigned a sequential id called an offset
○ Start from zero and increase to 1,2,3,... and so on
● Ordering in only guaranteed within partition for a topic
● Once the data is written to partition, it cannot be changed (Immutability)
● Data is retained for a configurable period of time (default is 7 days)
15. Kafka Core Concept - Topic and Partitions
For example, 1 topic 3 partitions
Partition 0
Partition 1
Partition 2
old new
0 1 2 3 4 5
0 1 2 3
0 1 2 3 4 5 6 7
write
Topic A
16. Kafka Core Concept - Kafka Brokers
Kafka Brokers :-
● Broker is a Kafka server which is contain partition of topic
● Each Broker has an ID (number)
● Kafka Cluster is composed of multiple Brokers (servers)
● Topic consist of partition that can spread to multiple nodes on cluster
● Connecting to one broker bootstraps client to entire cluster (bootstrap server)
● Start with at least 3 brokers, cluster can have 10, 100, 1000 brokers of
needed
19. Kafka Core Concept - Kafka Replication
Kafka replication factor, Failover, ISR
● Kafka replicated Topic Partitions
○ Across multiple nodes in cluster for failover
● For Topic with replication factor N, Kafka can tolerate upto N-1 server failures
without losing data
○ For example, you have 3 broker
Then replication factor is 3 - 1 = 2
■ This mean if there are 2 broker server alive
then so your data not be lost
■ this determines on how many brokers a partition will be replicated
20. Kafka Cluster
Kafka Core Concept - Kafka Replication
Broker 1
Topic A
Partition 0
Broker 2
Topic A
Partition 1
Topic A
Partition 0
Broker 3
Topic A
Partition 1
For example, Topic A with 2 partition , replication factor of 2
replicated
replicated
21. Kafka Cluster
Kafka Core Concept - Kafka Replication
Broker 1
Topic A
Partition 0
Broker 2
Topic A
Partition 1
Topic A
Partition 0
Broker 3
Topic A
Partition 1
For example, Topic A with 2 partition , replication factor of 2
22. Kafka Core Concept - Leader for Partition
Leader for Partition
● Each partition in topic has 1 leader and 0 or more replicas
● At any time only one broker can be leader for given partition
● Only leader can receive and serve data for partition
● The other broker will synchronize the data (follower)
○ The group of in-sync replicas for partition is called ISR (in-sync replicas)
● Therefore each partition is going have one leader and multiple ISR
● Kafka replication is for Failover
○ If one broker goes down then another broker (with ISR) can serve data
23. Kafka Core Concept - Leader in Partition
Topic A *
Partition 0
(Leader)
Broker 1
Topic A *
Partition 1
(Leader)
Broker 2
Topic A *
Partition 1
(ISR)
Broker 3
Topic A
Partition 0
(ISR)
replication
replication
27. Kafka Core Concept - Producers
Producers
● Producer write data to a topics (which is made partition)
● The load is balanced to many brokers
0 1 2 3 4 5
0 1 2 3
0 1 2 3 4 5 6 7
producer
Broker 1
Topic A,Partition 0
Broker 2
Topic A,Partition 1
Broker 3
Topic A,Partition 1
writes
writes
writes
Send data
28. Kafka Core Concept - Producers
Durable Write
● Durability can be configured with the producer configuration
○ acks=0 : The producer never waits for an ack (possible data lost)
○ acks=1 : The producer gets an ack after the leader has receive the data (limited data lost)
○ acks=all : The producer gets an ack after all ISRs receive the data (no data lost)
● Producer can trade off between throughput or durability of writes
29. Kafka Core Concept - Consumer
Consumer
● Consumer read data from topic
● Data is read in order within each partitions
● Message stay on kafka … they are not remove after they are consumed
Read in order
a, e, i ,k
c, g
b, d, f, h, j, l, m, n
30. Kafka Core Concept - Consumer Groups
Consumer Groups
● Consumers can be organised into
Consumer Groups
● If you have more consumers
than partition,
some consumer will be inactive
31. Kafka Core Concept - Consumer Offsets
Consumer Offsets
● Kafka topic store the offset at which a consumer group has been reading
(Check pointing / bookmarking)
● The offset committed stored in kafka topic named “__consumer_offsets”
● When a consumer in a group has processed data received from kafka, it
should be committing the offsets (though “__consumer_offsets”)
● If a consumer dies, it will be able to read back from where it left
32. Kafka Core Concept - Message Delivery Semantics
Delivery Semantics for consumer
● Consumer choose when to commit offsets
● There are 3 delivery semantics
○ At most once:
■ Read message, commit offset, process message
■ Messages may be lost but are never redelivered
○ At least once: (usually preferred)
■ Read message, process message, commit offset
■ Messages are never lost but may be redelivered
○ Exactly once:
■ each message is delivered once and only once
33. Kafka Core Concept - Zookeeper
Zookeeper
● Manage Broker (keep a list of them)
● Zookeeper help with leadership election of Kafka Broker and
Topic Partition paris
● Zookeeper manages service discovery for Kafka Brokers that
form the cluster
● Zookeeper sends notification to Kafka in case of changes
○ New Broker join,
○ Broker died,
○ Topic removed,
○ Topic added, etc
● Kafka cannot work without Zookeeper
34. Kafka CLI - 101
● Kafka Topics CLI
● Kafka Console Producer CLI
● Kafka Console Consumer CLI
● Kafka Consumer Groups CLI
● Resetting Offsets
● CLI Options that are good to know
● Let’s play kafka in action
Hi good morning everyone, today I will present apache kafka
We will start at 1st topic
We will focus on last two topic
Linkedin ผู้ใช้จํานวน 300 million users events ทุกวันๆ ทําให้ในบางทีมันเกิดปัญหาเรื่องของ data lost
ทีนี้มันก็เลยเกิดออกมาเป็น Kafka นั้นเอง! ออกแบบมาเพื่อให้จัดการข้อมูลขนาดใหญ่,การันตีข้อมูลจะถึงผู้รับ และเป็น distributed system
มีการกระจาย (distributed) การเก็บข้อมูลใน clusters
มีความยืดหยุ่น (resilient architecture) เช่น we can add/remove consumer at anytime, kafka will be rebalance the load
มีการทนต่อความเสียหาย (fault tolerant) = durable of data
มีความสามารถในการขยายเชิงขนาน หรือ เพิ่มเครื่อง (node) ใน cluster ได้ (horizontal scalability)
มีประสิทธิภาพด้านความเร็ว (latency น้อยกว่า 10ms)
Fault tolerant - backup part of data into several different server in cluster
Robust because Records written to Kafka server are persisted to disk and replicated to other server
Kafka come into picture
Kafka decouple/separate data from systems
Kafka really good at making your data move bec kafka really fast
Messing - Message Queue
ใช้ Kafka ทํา messaging มากกว่าตัวอื่นนน คงเป็นเพราะ replication,built-in partitioning, fault-tolerance กว่า traditional messaging อื่นๆ อย่าง rabbit MQ เป็นต้น …
Kafka often use instead of message queue such as rabbitMq because high throughput, reliability, replication, fault-tolerant
Stream processing
Because kafka is a real-time publish-subscribe message system then people usually use kafka as real-time processing, monitoring system
All these companies using kafka so then they can make real time recommendations,Real-time decision give you real-time insight to their user
Kafka compress your data fit to your bandwidth
For example, your network have a bandwidth is 10MB, but your data is 100MB
It is better send 10 times of 10MB instead of 100MB for 1 times for more efficient, reduce I/O latency
Sequential disk faster than random disk access
As you can see, it’s not that different. But still, sequential memory access is faster than Sequential Disk Access, why not choose memory? Because Kafka runs on top of JVM, which gives us two disadvantages.
The memory overhead of objects is very high, often doubling the size of the data stored(or even higher). ใช้ memory เยอะเพราะข้อมูลใน memory จะเป็น 2 เท่าของ data store
Garbage Collection happens every now and then, so creating objects in memory is very expensive as in-heap data increases because we will need more time to collect unused data(which is garbage) มีการเรียกใช้งาน garbage collection บ่อย
Kafka runs on top of JVM, if we write data into memory directly, the memory overhead would be high and GC would happen frequently. So we use MMAP here to avoid the issue.
MMAP is a map the file contents from the disk into memory
Producer get data from source system and send data to kafka
Consumer consume data from kafka and send it to target system
Zookeeper ที่ทำหน้าที่บริหารจัดการว่าควรจะไปอ่านข้อมูลที่ replica ตัวไหนหรือ Consumer นี้ควรไปอ่านข้อมูลตรงไหนต่อ
Zookeeper uses to manage kafka server
Kafka client -> broker_1 : connection establish + metadata request
Broker_1-> kafka client : return list of all broker
Kafka client -> broker_3 :Kafka client can connect to the needed server
We have to understand topic and partition first before we go dive drive into the kafkaTopic and partition uses to handle message in kafka
Topic - Topic are broken up into ordered commit log called partitions
Imagine kafka have a huge number of data, and how we know which message we can get?We need unique key to find topic
ลองจิตนาการข้อมูลที่เราส่งไปมาใน Kafka Stream นี้มันเยอะมากๆๆๆๆๆๆเลยน่ะ แล้วเราจะรู้ได้ไงว่าต้องเอาข้อความไหนล่ะ? คําตอบง่ายมากเลยก็คือต้องหา Unique key ใช่มั้ยล่ะ? ซึ่งเจ้าตัวนั้นก็คือ Topic นั้นเอง (หรือจะมองว่า Topic คือ Unique key ใน database ได้)
Partitions As we known Kafka server/Broker will store data, but some time data is very big and single computer cannot save itIt good idea to divide data into partitions and keep them to different machine (distributed data)
อย่างที่รู้ว่า Broker จะเป็นคนเก็บข้อมูลที่ส่งไปมาใช่มั้ย? แต่บางที่ข้อมูลที่เราเก็บเนี่ยมันใหญ่มากๆๆๆๆเกินกว่าที่ computer เครื่องนึงจะรับไหว เลยต้องใช้ distributed system เข้ามา
มันเลยมีไอเดียที่จะต้องแบ่ง data ออกเป็น Partition หลายๆๆส่วน แล้วกระจายไปเก็บไว้ใน distributed system นั้นเอง
ไม่สามารถแก้ไขข้อมูลที่เอาใส่ใน topic แล้ว(Immutability)
ข้อมูลจะถูกลบในเวลาที่เราตั้งไว้ (default 604800000 ms หรือ 7 วัน) ไม่ว่ามันจะถูกหยิบไปใช้หรือไม่ เพราะต้องเคลียร์พื้นที่ใน hdd (clear space in storage)Data in kafka have a limit time
Offset: A record in a partition has an offset associated with it. Think of it like this: partition is like an array; offsets are like indexs.
Order gaurantee only within partition (not accross partition)
Start from zero and increase to 1,2,3,... and so on
Look at partition 0 -> latest offset is 4 the next one should be offset 5
Offset just specify position of partition
เริ่มต้นที่ยังไม่มีค่าอะไร พอข้อมูลถูกเพิ่มมาครั้งแรก ตัว offset จะถูกนับเป็น 0 จากนั้น ถูกเพิ่มมาเรื่อยๆ ก็นับเพิ่มตามมาเรือยๆ ตามลำดับ
ข้อมูลจะถูกเรียงตามลำดับก่อนหลัง ใน partition นั้นๆ แปลว่า offset ที่ 0 ของ partition ที่ 0 อาจจะมาก่อนหรือหลัง offset ที่ 0 ของ partition 1 ก็ได้
Bootstrap server - Each Broker knows about all broker , all topic and all partitions (metadata)
Topic 1 has 3 partitions
Topic 2 has 2 partitions
Data is broken up to partition and distribute to different brokers/machine
When you created topic, kafka automatic assign the topic & distribute it across all your brokers
Multiple broker of single group is called Kafka Cluster
Kafka is distributed system
Replication means If one broker goes down, but things will still working
-----
แต่ละ topic จะมีการทำสำเนา (replica) partition จากตัวหลัก(leader)ไปสำเนากี่ server หรือ broker (เรียกว่า in-sync replication คือถ้าข้อมูลเข้ามาตัวหลักก็ส่งไปทำสำเนาเพิ่มเลย) ซึ่งควรมีมากกว่า 1 ปกตินิยม 2–3 ครับ และ partition ตัวหลักก็มีได้ 1 ตัว ซึ่งคุณสมบัตินี้เอง เป็นตัวที่ทำให้ Apacha Kafka มีคุณสมบัติ Fault Tolerance (ความทนต่อความเสียหาย) ถ้ามี broker ตัวนึงเน่าไป มันยังสามารถไปอ่านและบันทึกข้อมูลต่อในตัวที่เป็น replica ได้ด้วย โดยการเปลี่ยน replica ให้เป็น leader
----
ถ้าหาก broker ตัวใดตัวนึงตาย ไป เรายังมีอีก 2 ตัวให้ทำงาน แต่ถ้าหากมันตายมากกว่า 1 ตัวก็จบ เพราะฉะนั้น การกำหนดตัวเลข replication factor จึงสำคัญ ยิ่งมากยิ่งปลอดภัย ซึ่งจะทนการเน่าของ server ได้ n-1 ตัว ถ้าสมมติกำหนด replication factor เป็น n แต่ก็มีสิ่งแลกเปลี่ยนคือมันเปลืองพื้นที่ในการเก็บ ก็ต้องประมาณจากความเสี่ยงที่มีเองครับ
you have two copies for each data
If broker 2 goes down but topic A will be not lost
Replica allow us to ensure that data would not be lost
Only leader can receive and serve data for partition
>> In Other word, The leader can be read and write data in partition
that consumer/producer use for exchange message
The other broker will synchronize the data
>> it is called Follower is other replicas, they don’t serve client request
And they replicate message from the leader to “in-sync replica” (ISR)
Who decode leader and follower ?
Answer Zookeeper
❖ Kafka Replication is for Failover
❖ Mirror Maker is used for Disaster Recovery
❖ Mirror Maker replicates a Kafka cluster to another data-center or AWS region
❖ Called mirroring since replication happens within a cluster
What is decides leader & ISR ?
zookeeper
As I mentioned before, Single topic can have 1 leader and other is a follower
Broker 1 is a leader
Broker 2 and Broker 3 are follower
What happen If Broker 1 goes down
Then Broker 2 will become leader because Broker 2 it was ISR before
And then If Borker 1 come back
Broker1 it will try to become a leader again after replicate the data
Message are appended to topic - partition in the order they are sent
Basically. Producer send data without key then data will be be send round robin to broker 1, broker 2, broker 3
If producer send data with key, kafka will hashing the key and use it’s value to find which broker is selected
If once key go there then it will go there all the time, key will always goto same position of partition
ถ้าเราไม่กำหนด key ให้มัน มันก็จะวนส่งข้อมูลแบบ round-robin ไปยัง broker ที่ partition ใน topic นั้นๆ อยู่
แต่ถ้ามีการกำหนด key ไว้ หลังจากการ produce ครั้งแรกแล้ว ถ้ามี key เดิมซ้ำกันเข้ามา มันจะวิ่งเข้าไปหา broker ที่เดิมที่ key นั้นเคยเข้าไปอยู่ครั้งแรก
ส่วนเงื่อนไขที่จะพิจารณาว่า message key อันไหน ควรอยู่ที่ partition ไหน ในกรณีที่ยังไม่มี key อันนั้น อันนี้มันจะเอา key ไป hash แล้วเอาค่ามาทำอะไรบางอย่าง
Producer can choose to receive acknowledgement of data writes
Acknowledgement is synonym for confirmation
There are 3 confirmation
acks=0 just send data
acks=1 to get ack of write only leader (default)
acks=all get after all ISR (leader & replicas) received
Consumer read message in the order stored in topic-partitionConsumers
หน้าที่หลักของ consumer คืออ่านข้อมูลจาก partition เพียงแค่ต่อเข้า broker สักตัว แล้วระบุ topic ไป มันก็จะอ่านให้เอง ไม่ว่า partition จะอยู่ที่ broker ไหนก็ตาม เหมือนๆ กับฝั่ง producer เลยครับ
Order
อยากเน้นเรื่องลำดับการอ่านอีกแล้วครับ ตัว consumer จะอ่านข้อมูลตามลำดับใน partition เดียวกัน แต่ถ้าต่าง partition มันจะอ่านแบบขนาน(parallel) เพราะฉะนั้น การออกแบบการเรียงลำดับตัว message key ตั้งแต่ฝั่ง consumer เลย จริงเป็นสิ่งที่สำคัญมาก ถ้าข้อมูลจำเป็นต้องถูกเรียกใช้ตามลำดับ
If our system received many many topic from producer, at the same time we have consumer not enough consumer then we add more consumer
we can add/remove consumer at anytime, kafka will be rebalance the load
Consumer Groups
ปัญหาคือ ถ้าหากระบบเรา produce ข้อมูลเข้ามามากๆ ในขณะเดียวกัน consumer ที่เรามีก็น้อยไม่เพียงพอ เราเพิ่ม consumer ได้เลย เพื่อให้มัน consume ได้เร็วขึ้น โดยมันจะ consume แบบขนานกันไป ซึ่งหนึ่งข้อบังคับของ consumer ใน group หนึ่ง ต้องมีจำนวน consumer ไม่เกินจำนวน partition ใน topic ที่ consumer นั้นสนใจ ถ้ามีเกินมา ตัวนั้นจะไม่ได้ทำอะไร
จากตัวอย่างเดียวกัน จะสังเกตดูได้ว่าจะไม่มี consumer อ่าน partition ที่ซ้ำกันเลย แปลว่าไม่ว่าเราจะเพิ่มหรือลด consumer มันจะไม่อ่านข้อมูลซ้ำกันเด็ดขาด
และความสุดยอดอีกอย่างนึงคือ เราสามารถเพิ่มและลด consumer ได้ตอนไหนก็ได้ มันจะมีการแบ่ง(re-balance)หน้าที่ของ consumer ที่มีให้เองว่าควรไปอ่านที่ partition ไหน ซึ่งตรงนี้เป็นคุณสมบัติ resilient หรือความยืดหยุ่นนั่นเอง
อาจจะสงสัยนะครับ แล้วถ้ามีหลาย group ล่ะ(เพื่อจุดประสงค์การ consume ที่ต่างออกไป เช่นมี serviceใหม่) มันจะเริ่มอ่านจากตรงไหน ใน topic เดียวกัน คำตอบคือ แต่ละ group จะมีตัวนับ(offset) ของมันเอง สมมติมี group แรก กำลังทำงานอยู่ แล้วเราเพิ่ม group ที่ 2 เข้ามา มันก็จะเริ่มอ่านตั้งแต่แรก(ซึ่งเซ็ตได้ว่าจะให้อ่านจากตั้งแต่แรกหรือจากข้อมูลปัจจุบัน) คือตัวนับมันจะแยกกันโดยสิ้นเชิง
two consumers cannot consume messages from the same partition at the same time. A consumer can consume from multiple partitions at the same time.
How kafka know consumer will read which next topic ?
ตัวนับว่าตัวถัดของ consumer ใน consumer group ควรใช้ offset อะไร Store offset comited in topic named “__consumer_offsets” (version < 0.9 “__consumer_offsets” keep in zookeeper)
I ‘m died, now I ‘m back alive. So now, I can start at this offset & continue read from there
At most once
Offset are committed as soon as the message is received
If the processing goes wrong the message will be lost (it won’t be read again)
At least once
Offset are committed after the message processed
You read data and do something with data and then commit the offset, if processing goes wrong, If your consumer goes down, then the message will be read agian
- ทำหน้าที่จัดการ brokers คือรู้ว่า broker ตัวไหน อยู่ที่ไหน ตายอยู่หรือไม่ตาย
- บันทึกว่า topic ไหนมีหรือไม่มี มีกี่ partition ใน topic นี้
- ทำการเลือก leader/replica ของ partition
- ส่งสัญญาณไปหา Kafka ในทุกๆ การเปลี่ยนแปลงที่เกิดขึ้น เช่น มี topic มาใหม่ หรือมี broker ตาย หรือเพิ่มขึ้นมา
- บันทึกว่า producer/consumer แต่ละตัวควรจะเขียนหรืออ่าน data ได้เท่าไหร่
- เก็บ Authorization ว่า user ไหนถูกอนุญาตให้สร้าง topic บ้าง
- บันทึกว่าแต่ละ consumer group มี consumer กี่ตัว อ่านไปถึง offset ไหนแล้ว
- มีพรรคพวก(quorum)ของมันเอง นิยมให้เป็นจำนวนเลขคี่เช่น มี 3,5,7… ตัวของ จำนวน Zookeeper เพราะมีเรื่อง consensus ในการบันทึกข้อมูลด้วย เช่น ต้องเป็นจำนวนมากกว่าครึ่งนงของ Zookeeper ที่รันอยู่เช็คแล้วว่าถูกบันทึกแล้ว ตัว leader ของ Zookeeper เองจะบันทึกเสร็จแล้วจริงๆ
- Zookeeper has a leader (handle write) and the rest of the server are follower (handle read)
- Consmer & producer don’t write to zookeeper, they write to kafka
- kafka just manage all metadata in zookeeper
-zookeeper does not store consumer offsets
Kafka is require using java version 8 not 9, not 10
Virsulization tool -> kafkatool
Create topic
kafka-topics --zookeeper 127.0.0.1:2181 --topic first_topic --create --replication-factor 1 --partitions 3
List topic
kafka-topics --zookeeper 127.0.0.1:2181 --list
Describe topic
kafka-topics --zookeeper 127.0.0.1:2181 --topic first_topic --describe
Delete topic
kafka-topics --zookeeper 127.0.0.1:2181 --topic first_topic --delete
Console Produce CLI - produce
kafka-console-producer --broker-list localhost:9092 --topic first_topic
Console Consume CLI - consume kafka-console-consumer --bootstrap-server localhost:9092 --topic first_topic
Console Consume CLI - consume from begin
kafka-console-consumer --bootstrap-server localhost:9092 --topic first_topic --from-beginning
Consumer Group
kafka-console-consumer --bootstrap-server localhost:9092 --topic topic_1 --group my_app
Consumer Group CLI – list
kafka-consumer-group --bootstrap-server localhost:9092 --list
Consumer Group CLI – describe
kafka-consumer-group --bootstrap-server localhost:9092 --group my_app --describe
Consumer Group CLI - reset offset
kafka-consumer-group --bootstrap-server localhost:9092 --group my_app --reset-offsets --to-earliest --execute
CLi Option
Produce topic with key
kafka-console-producer --broker-list localhost:9092 --topic first_topic --property parse.key=true --property key.separator=,
Consumer with key
kafka-console-consumer --broker-list localhost:9092 --topic first_topic --property print.key=true --property key.separator=,