This document provides an overview of Apache Kafka. It discusses Kafka's key capabilities including publishing and subscribing to streams of records, storing streams of records durably, and processing streams of records as they occur. It describes Kafka's core components like producers, consumers, brokers, and clustering. It also outlines why Kafka is useful for messaging, storing data, processing streams in real-time, and its high performance capabilities like supporting multiple producers/consumers and disk-based retention.
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
Why is Kafka so fast? Why is Kafka so popular? Why Kafka? This slide deck is a tutorial for the Kafka streaming platform. This slide deck covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example to demonstrate failover of brokers as well as consumers. Then it goes through some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have also expanded on the Kafka design section and added references. The tutorial covers Avro and the Schema Registry as well as advance Kafka Producers.
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
Why is Kafka so fast? Why is Kafka so popular? Why Kafka? This slide deck is a tutorial for the Kafka streaming platform. This slide deck covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example to demonstrate failover of brokers as well as consumers. Then it goes through some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have also expanded on the Kafka design section and added references. The tutorial covers Avro and the Schema Registry as well as advance Kafka Producers.
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Kafka is a real-time, fault-tolerant, scalable messaging system.
It is a publish-subscribe system that connects various applications with the help of messages - producers and consumers of information.
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
The session discusses on how companies are using Apache Kafka & also covers under the hood details like partitions, brokers, replication.
About apache kafka: Apache Kafka is a distributed a streaming platform, Apache Kafka provides low-latency, high-throughput, fault-tolerant publish and subscribe pipelines and is able to process streams of events. Kafka provides reliable, millisecond responses to support both customer-facing applications and connecting downstream systems with real-time data.
ksqlDB is a stream processing SQL engine, which allows stream processing on top of Apache Kafka. ksqlDB is based on Kafka Stream and provides capabilities for consuming messages from Kafka, analysing these messages in near-realtime with a SQL like language and produce results again to a Kafka topic. By that, no single line of Java code has to be written and you can reuse your SQL knowhow. This lowers the bar for starting with stream processing significantly.
ksqlDB offers powerful capabilities of stream processing, such as joins, aggregations, time windows and support for event time. In this talk I will present how KSQL integrates with the Kafka ecosystem and demonstrate how easy it is to implement a solution using ksqlDB for most part. This will be done in a live demo on a fictitious IoT sample.
Introducing Apache Kafka - a visual overview. Presented at the Canberra Big Data Meetup 7 February 2019. We build a Kafka "postal service" to explain the main Kafka concepts, and explain how consumers receive different messages depending on whether there's a key or not.
In this session you will learn:
1. Kafka Overview
2. Need for Kafka
3. Kafka Architecture
4. Kafka Components
5. ZooKeeper Overview
6. Leader Node
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Kafka is a real-time, fault-tolerant, scalable messaging system.
It is a publish-subscribe system that connects various applications with the help of messages - producers and consumers of information.
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
The session discusses on how companies are using Apache Kafka & also covers under the hood details like partitions, brokers, replication.
About apache kafka: Apache Kafka is a distributed a streaming platform, Apache Kafka provides low-latency, high-throughput, fault-tolerant publish and subscribe pipelines and is able to process streams of events. Kafka provides reliable, millisecond responses to support both customer-facing applications and connecting downstream systems with real-time data.
ksqlDB is a stream processing SQL engine, which allows stream processing on top of Apache Kafka. ksqlDB is based on Kafka Stream and provides capabilities for consuming messages from Kafka, analysing these messages in near-realtime with a SQL like language and produce results again to a Kafka topic. By that, no single line of Java code has to be written and you can reuse your SQL knowhow. This lowers the bar for starting with stream processing significantly.
ksqlDB offers powerful capabilities of stream processing, such as joins, aggregations, time windows and support for event time. In this talk I will present how KSQL integrates with the Kafka ecosystem and demonstrate how easy it is to implement a solution using ksqlDB for most part. This will be done in a live demo on a fictitious IoT sample.
Introducing Apache Kafka - a visual overview. Presented at the Canberra Big Data Meetup 7 February 2019. We build a Kafka "postal service" to explain the main Kafka concepts, and explain how consumers receive different messages depending on whether there's a key or not.
In this session you will learn:
1. Kafka Overview
2. Need for Kafka
3. Kafka Architecture
4. Kafka Components
5. ZooKeeper Overview
6. Leader Node
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
Uber has one of the largest Kafka deployment in the industry. To improve the scalability and availability, we developed and deployed a novel federated Kafka cluster setup which hides the cluster details from producers/consumers. Users do not need to know which cluster a topic resides and the clients view a "logical cluster". The federation layer will map the clients to the actual physical clusters, and keep the location of the physical cluster transparent from the user. Cluster federation brings us several benefits to support our business growth and ease our daily operation. In particular, Client control. Inside Uber there are a large of applications and clients on Kafka, and it's challenging to migrate a topic with live consumers between clusters. Coordinations with the users are usually needed to shift their traffic to the migrated cluster. Cluster federation enables much control of the clients from the server side by enabling consumer traffic redirection to another physical cluster without restarting the application. Scalability: With federation, the Kafka service can horizontally scale by adding more clusters when a cluster is full. The topics can freely migrate to a new cluster without notifying the users or restarting the clients. Moreover, no matter how many physical clusters we manage per topic type, from the user perspective, they view only one logical cluster. Availability: With a topic replicated to at least two clusters we can tolerate a single cluster failure by redirecting the clients to the secondary cluster without performing a region-failover. This also provides much freedom and alleviates the risks for us to carry out important maintenance on a critical cluster. Before the maintenance, we mark the cluster as a secondary and migrate off the live traffic and consumers. We will present the details of the architecture and several interesting technical challenges we overcame.
Hello, kafka! (an introduction to apache kafka)Timothy Spann
Hello ApacheKafka
An Introduction to Apache Kafka with Timothy Spann and Carolyn Duby Cloudera Principal engineers.
We also demo Flink SQL, SMM, SSB, Schema Registry, Apache Kafka, Apache NiFi and Public Cloud - AWS.
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpJosé Román Martín Gil
Apache Kafka is the most used data streaming broker by companies. It could manage millions of messages easily and it is the base of many architectures based in events, micro-services, orchestration, ... and now cloud environments. OpenShift is the most extended Platform as a Service (PaaS). It is based in Kubernetes and it helps the companies to deploy easily any kind of workload in a cloud environment. Thanks many of its features it is the base for many architectures based in stateless applications to build new Cloud Native Applications. Strimzi is an open source community that implements a set of Kubernetes Operators to help you to manage and deploy Apache Kafka brokers in OpenShift environments.
These slides will introduce you Strimzi as a new component on OpenShift to manage your Apache Kafka clusters.
Slides used at OpenShift Meetup Spain:
- https://www.meetup.com/es-ES/openshift_spain/events/261284764/
Haitao Zhang, Uber, Software Engineer + Yang Yang, Uber, Senior Software Engineer
Kafka Consumer Proxy is a forwarding proxy that consumes messages from Kafka and dispatches them to a user registered gRPC service endpoint. With Kafka Consumer Proxy, the experience of consuming messages from Apache Kafka for pub-sub use cases is as seamless and user-friendly as receiving (g)RPC requests. In this talk, we will share (1) the motivation for building this service, (2) the high-level architecture, (3) the mechanisms we designed to achieve high availability, scalability, and reliability, and (4) the current adoption status.
https://www.meetup.com/KafkaBayArea/events/273834934/
How to use kakfa for storing intermediate data and use it as a pub/sub model with each of the Producer/Consumer/Topic configs deeply and the Internals working of it.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
2. Data integration and stream processing
● Streaming data includes a wide variety of data such as
ecommerce purchases, information from social networks or
geospatial services from mobile devices.
● Streaming data needs to be processed sequentially and
incrementally in order to, for instance, integrate
different applications that consumes the data, or storing
and processing the data to update metrics, reports, and
summary statistics in response to each arriving data
record.
2
3. Data integration and stream
processing
It is better suited for real-time monitoring and response
functions such as
● Classifying a banking transaction as fraudulent based on
an analytical model then automatically blocking the
transaction
● Sending push notifications to users based on models about
their behavior
● Adjusting the parameters of a machine based on result of
real-time analysis of its sensor data
3
4. Data integration and stream processing
The problem, therefore, is how to build an infrastructure
that is:
● Decoupled
● Evolvable
● Operationally transparent
● Resilient to traffic spikes
● Highly available
● Distributed
4
6. sTREAMING PROCESSING
A streaming platform has three key capabilities:
● Publish and subscribe to streams of records, similar to a
message queue. This allows multiple application
subscribed to a same or different data sources that
produce data in one or different topics.
● Store streams of records in a fault-tolerant durable way.
This means different clients can access all the events
(or a fraction of them) at any time, at their own pace.
● Process streams of records as they occur. This allows
filtering, analysing, aggregating or transforming data.
6
7. aPACHE KAFKA
Kafka is a publish/subscribe messaging system designed to
solve the problem of managing continuous data flows.
● Building real-time streaming data pipelines that reliably
get data between systems or applications
● Building real-time streaming applications that transform
or react to the streams of data
7
8. Components & Key concepts
Kafka use four different Application Programming Interfaces
(API) to enable building different and some key concepts
components:
1. Producers
2. Consumers
3. Brokers
4. Clustering and Leadership Election
8
9. PRODUCER
● The Producer API allows an
application to publish a
stream of records to one or
more topics.
● In Kafka, the data records
are know as messages and
they are categorized into
topics .
● Think of messages as the
data records and topics as
a database table.
9
10. PRODUCER
A message is made of up two components:
● 1. Key : The key of the message. This key would
determines the partition the message would be sent to.
Careful thought has to be given when deciding on a key
for a message. Since a key is mapped to a single
partition, an application that pushes millions of
messages with one particular key and only a fraction with
other key would result in an uneven distribution of load
on the Kafka cluster. If the key is set to null, the
producer will use a Round robin algorithm and assign a
key to random partition 10
11. PRODUCER
● 2. Body : Each message has a body. Since Kafka provides
APIs for multiple programming languages, the message has
to be serialized in a way that the subscribers on the
other end can understand it. There are encoding protocols
that the developer can use to serialize a message like
JSON, Protocol Buffers, Apache AVRO, Apache Thrift, to
name a few
11
12. CONSUMERS
● The Consumer API on the other hand allows application
subscription to one or more topics to store/process/react
to the stream of records produced to them.
● To accomplish this, Kafka adds to each message a unique
integer value. This integer is incremented by one for
every message that is.
12
13. CONSUMERS
● This value is known as the offset of the message. By
storing the offset of the last consumed message for each
partition, a consumer can stop and restart without losing
its place.
● This is why Kafka allows different types of applications
to integrate to a single source of data. The data can be
processed at different rates by each consumer.
13
14. CONSUMERS
● Here is also important to know that Kafka allows the
existence of consumer groups , which are nothing more
than consumers working together to process a topic.
● The concept of consumer groups allows to add scale
processing of data in Kafka, this is will be reviewed in
the next section.
● Each consumer group identifies itself by a Group Id. This
value has to be unique amongst all consumer groups.
14
15. CONSUMERS
● In a consumer group, each consumer is assigned to a
partition. If there are more consumers than the number of
partitions, some consumers will be sitting idle.
● For instance, if the number of partitions for a topic is
5, and the consumer group has 6 consumers, one of those
consumers will not get any data.
● On the other hand, if the number of consumers in this
case is 3, then Kafka will assign a combination of
partitions to each consumer.
15
16. CONSUMERS
● When a consumer in the consumer group is removed, Kafka
will reassign some partitions that were originally for
this consumer to other consumers in the group.
● If a consumer is added to a consumer group, then Kafka
will take some partition from the existing consumers to
the new consumer.
16
17. BROKERS
● Take into account that in the Kafka cluster a single
Kafka server is called a broker. The broker receives
messages from producers, assigns offsets to them, and
commits the messages to storage on disk.
● It also services consumers, responding to fetch requests
for partitions and responding with the messages that have
been committed to disk.
● Depending on the specific hardware and its performance
characteristics, a single broker can easily handle
thousands of partitions and millions of messages per
second.
17
18. cLUSTERING AnD lEADERSHIP ELECTION
● a Kafka cluster is made up of brokers. Each broker is
allocated a set of partitions.
● A broker can either be the leader or a replica of the
partition.
● Each partition also has a replication factor that goes
with it.
● For instance, imagine that we have 3 brokers namely b1,
b2 and b3. A producer pushes a message to the Kafka
cluster.
18
19. cLUSTERING AnD lEADERSHIP ELECTION
● Using the key of the message, the producer will decide
which partition to send the message to.
● Let’s say the message goes to a partition p1 which
resides on the broker b1 which is also the leader of this
partition, and the replication factor of 2, and the
replica of partition p1 resides on b2.
19
20. cLUSTERING AnD lEADERSHIP ELECTION
● When the producer sends a message to partition p1, the
producer API will send the message to broker b1, which is
the leader. If the “acks” property has been configured to
“all”, then the leader b1 will wait till the message has
been replicated to broker b2. If the “acks” has been set
to “one”, then b1 will write the message in its local log
and the request would be considered complete.
● In the event of the leader b1 going down, Apache
Zookeeper initiates a Leadership election.
● In this particular case, since we have only broker b2
available for this partition, it becomes the leader for
p1. 20
21. CONNECTOR API
● The Connector API allows building and running reusable
producers or consumers that connect Kafka topics to
existing applications or data systems.
● For example, a connector to a relational database might
capture every change to a table. If you think in a ETL
system, connectors are involved in the Extraction and
Load of data
21
22. CONNECTOR API- stream api
● The Streams API allows an application to act as a stream
processor, consuming an input stream from one or more
topics and producing an output stream to one or more
output topics, effectively transforming the input streams
to output streams.
● Basically, the streams API helps to organize the data
pipeline. Again, thinking in a ETL system, Streams are
involved in data transformation.
22
23. ● Apache Zookeeper is a distributed,
open-source configuration,
synchronization service that
provides multiple features for
distributed applications.
● Kafka uses it in to manage the
cluster, storing for instance
shared information about consumers
and brokers.
● The most important role is to keep
a track of consumer group offsets.
23
CONNECTOR API-Apache zookeeper
24. WHY kAFKA : Messaging - Publishers/Subscribers
● Read and write streams of data like a messaging system.
● As mentioned before producers create new messages or
events .
○ a message will be produced to a specific topic. In the same way,
consumers (also known as subscribers) read messages.
○ The consumer subscribes to one or more topics and reads the messages
in the order in which they were produced.
○ The consumer keeps track of which messages it has already consumed by
keeping track of the offset of messages.
● important concept to explore: A message queue allows you
to scale processing of data over multiple consumer’s
instances that process the data.
24
25. WHY kAFKA : Messaging - Publishers/Subscribers
● Unfortunately, once a message is consumed from the queue
the message is not available anymore for others consumers
that may be interested in the same message.
● Publisher/subscriber in contrast allows you to publish
broadcast each message to a list of consumers or
subscribers , but by itself does not scale processing.
● Kafka offers a mix of those two messaging models: Kafka
publishes messages in topics that broadcast all the
messages to different consumer groups. The consumer group
acts as a message queue that divides up processing over
all the members of a group.
25
26. WHY KAFKA: STORE
● Store streams of data safely in a distributed,
replicated, fault-tolerant cluster.
● A file system or database commit log is designed to
provide a durable record of all transactions so that they
can be replayed to consistently build the state of a
system.
● Similarly, data within Kafka is stored durably, in order,
and can be read deterministically (Narkhede et al. 2017)
26
27. WHY KAFKA: STORE
● This last point is important, in a message queue once the
message is consumed it disappears, a log in contrast
allows to maintain the message (or events) during a
configurable time period, this is known as retention.
● Thanks to the retention feature, Kafka allows different
applications to consume and process messages at their own
pace, depending for instance in its processing capacity
or its business purpose.
27
28. WHY KAFKA: STORE
● In the same way, if a consumer goes down, it can continue
to process the messages in the log once it is recovered.
This technique is called event sourcing, which is that
whenever we make a change to the state of a system, we
record that state change as an event, and we can
confidently rebuild the system state by reprocessing the
events at any time in the future (Fowler 2017).
● In addition, since the data is distributed within the
system it provides additional protection against
failures, as well as significant opportunities for
scaling performance (Helland 2015) .
28
29. WHY KAFKA: STORE
● Kafka is therefore a kind of special purpose distributed
file system dedicated to high-performance, low-latency
commit log storage, replication, and propagation.
● This doesn’t mean Kafka purpose is to replace storage
systems, but it results helpful in keeping data
consistency in a distributed set of applications.
29
30. WHY KAFKA: pROCESS
● Write scalable stream processing applications that react
to events in real-time.
● As previously described, a stream represents data moving
from the producers to the consumers through the brokers.
Nevertheless, it is not enough to just read, write, and
store streams of data; the purpose is to enable real-time
processing of streams.
● Kafka allows building very low-latency pipelines with
facilities to transform data as it arrives, including
windowed operations, joins, aggregations, etc.
30
32. Why kafka: features muliple producers
● Kafka is able to seamlessly handle multiple producers
that help to aggregate data from many data sources in a
consistent way.
● A single producer can send messages to one or more
topics. Some important properties that can be set at the
producer’s end are as follows:
○ acks : The number of acknowledgements that the producer requires from
the leader of the partition after sending a message. If set to 0, the
producer will not wait for any reply from the leader. If set to 1,
the producer will wait till the leader of the partition writes the
message to its local log. If set to all, the producer will wait for
the leader to replicate the message to the replicas of the partition.
32
33. Why kafka: features muliple producers
● Some important properties that can be set at the
producer’s end are as follows:
○ batch.size : The producer will batch messages together as set by this
parameter in bytes. This improves performance as now a single TCP
connection will send batches of data instead of sending one message
at a time
○ compression.type : The Producer API provides mechanisms to compress a
message before pushing it to the Kafka cluster. The default is no
compression, but the API provides gzip, snappy and and lz4
33
34. Why kafka: features multiple consumers
● In the same way, the publish/subscribe architecture
allows multiple consumers to process the data.
● Each message is broadcasted by the Kafka cluster to all
the subscribed consumers. This allows to connect multiple
applications to the same or different data sources,
enabling the business to connect new services as they
emerge.
34
35. Why kafka: features multiple consumers
● Furthermore, remember that for scaling up processing,
Kafka provides the consumer groups. Whenever a new
consumer groups comes online, and is identified by its
group id, it has a choice of reading messages from a
topic from the first offset, or the latest offset.
● Kafka also provides a mechanism to control the influx of
messages to a Kafka.
35
36. Why kafka: features multiple consumers
● Some interesting properties that could be set at the
consumers end are discussed below:
○ auto.offset.reset : One can configure the consumer library of choice
to start reading messages from the earliest offset or the latest
offset. It is useful to use “earliest” if the consumer group is new
and wants to processing from the very beginning
○ enable.auto.commit : If set to true, the library would commit the
offsets periodically in the background, as specified by the
auto.commit.interval.ms interval
○ Max.poll.records : The maximum number of records returned in a batch
36
37. Why kafka: features disk-based retention
● log.retention.bytes : The maximum size a log for a
partition can go to before being deleted
● log.retention.hours : The number of hours to keep a log
file before deleting it (in hours), tertiary to
log.retention.ms property
● log.retention.minutes : The number of minutes to keep a
log file before deleting it (in minutes), secondary to
log.retention.ms property. If not set, the value in
log.retention.hours is used
● Log.retention.ms : The number of milliseconds to keep a
log
37
38. Why kafka: features high performance
● Kafka’s flexible allows adding multiple brokers to makes
scale horizontally.
● Expansions can be performed while the cluster is online,
with no impact on the availability of the system as a
whole.
● All these features converge to improve Kafka’s
performance.
38