Apache Kafka is a distributed streaming platform used at WalmartLabs for various search use cases. It decouples data pipelines and allows real-time data processing. The key concepts include topics to categorize messages, producers that publish messages, brokers that handle distribution, and consumers that process message streams. WalmartLabs leverages features like partitioning for parallelism, replication for fault tolerance, and low-latency streaming.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
Many architectures include both real-time and batch processing components. This often results in two separate pipelines performing similar tasks, which can be challenging to maintain and operate. We'll show how a single, well designed ingest pipeline can be used for both real-time and batch processing, making the desired architecture feasible for scalable production use cases.
Emerging technologies /frameworks in Big DataRahul Jain
A short overview presentation on Emerging technologies /frameworks in Big Data covering Apache Parquet, Apache Flink, Apache Drill with basic concepts of Columnar Storage and Dremel.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
Many architectures include both real-time and batch processing components. This often results in two separate pipelines performing similar tasks, which can be challenging to maintain and operate. We'll show how a single, well designed ingest pipeline can be used for both real-time and batch processing, making the desired architecture feasible for scalable production use cases.
Emerging technologies /frameworks in Big DataRahul Jain
A short overview presentation on Emerging technologies /frameworks in Big Data covering Apache Parquet, Apache Flink, Apache Drill with basic concepts of Columnar Storage and Dremel.
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
Apache Kafka is a new breed of messaging system built for the "big data" world. Coming out of LinkedIn (and donated to Apache), it is a distributed pub/sub system built in Scala. It has been an Apache TLP now for several months with the first Apache release imminent. Built for speed, scalability, and robustness, Kafka should definitely be one of the data tools you consider when designing distributed data-oriented applications.
The talk will cover a general overview of the project and technology, with some use cases, and a demo.
Spark Streaming has supported Kafka since it's inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.Apache Kafka 0.10 (actually since 0.9) introduced the new Consumer API, built on top of a new group coordination protocol provided by Kafka itself.
So a new Spark Streaming integration comes to the playground, with a similar design to the 0.8 Direct DStream approach. However, there are notable differences in usage, and many exciting new features. In this talk, we will cover what are the main differences between this new integration and the previous one (for Kafka 0.8), and why Direct DStreams have replaced Receivers for good. We will also see how to achieve different semantics (at least one, at most one, exactly once) with code examples.
Finally, we will briefly introduce the usage of this integration in Billy Mobile to ingest and process the continuous stream of events from our AdNetwork.
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
The driving question behind redesigns of countless data collection architectures has often been, ?how can we make the data available to our analytical systems faster?? Increasingly, the go-to solution for this data collection problem is Apache Flume. In this talk, architectures and techniques for designing a low-latency Flume-based data collection and delivery system to enable Hadoop-based analytics are explored. Techniques for getting the data into Flume, getting the data onto HDFS and HBase, and making the data available as quickly as possible are discussed. Best practices for scaling up collection, addressing de-duplication, and utilizing a combination streaming/batch model are described in the context of Flume and Hadoop ecosystem components.
Near-realtime analytics with Kafka and HBasedave_revell
A presentation at OSCON 2012 by Nate Putnam and Dave Revell about Urban Airship's analytics stack. Features Kafka, HBase, and Urban Airship's own open source projects statshtable and datacube.
A technical overview on Kafka including key terminologies like topic, partition, offset, Kafka architecture (zookeeper, broker and replication). Also covers Kafka and IT team analogy. Read the article @ https://www.linkedin.com/pulse/kafka-technical-overview-sylvester-daniel/
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...confluent
We have been served well by Zookeeper over the years, but it is time for Kafka to stand on its own. This is a talk on the ongoing effort to replace the use of Zookeeper in Kafka: why we want to do it and how it will work. We will discuss the limitations we have found and how Kafka benefits both in terms of stability and scalability by bringing consensus in house. This effort will not be completed over night, but we will discuss our progress, what work is remaining, and how contributors can help. (Note that I am proposing this as a joint talk with Colin McCabe, who is also a committer on the Apache Kafka project.)
A step-by-step deep dive into Kafka Security world. This presentation covers few most sought-after questions in Streaming / Kafka; like what happens internally when SASL / Kerberos / SSL security is configured, how does various Kafka components interacts with each other. This could be valuable resource for administrators, users & Application developers alike. Having internal Kafka knowledge would help them to configure, manage and use the Kafka systems in a more optimal way with least possible errors / mistakes.
Agenda is to discuss:
- Various Kafka Security model available: PLAINTEXT, SASL_PLAINTEXT, SASL_SSL, PLAINTEXT_SSL and when to use which model
- Anatomy of each Security model: in-depth examination of these models and what happens internally when they are used; with real life examples
- Do's and Don'ts of Kafka Security
- Common Errors & Troubleshooting
This talk will be all about looking under-the-hood with respect to Kafka Security. Suitable for all levels from beginners to expert.
Speaker
Vipin Rathor, Sr. Product Specialist (security), Hortonworks
Kafka is a real-time, fault-tolerant, scalable messaging system.
It is a publish-subscribe system that connects various applications with the help of messages - producers and consumers of information.
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. How can me make sure that all these event are accepted and forwarded in an efficient and reliable way? This is where Apache Kafaka comes into play, a distirbuted, highly-scalable messaging broker, build for exchanging huge amount of messages between a source and a target.
This session will start with an introduction into Apache and presents the role of Apache Kafka in a modern data / information architecture and the advantages it brings to the table. Additionally the Kafka ecosystem will be covered as well as the integration of Kafka in the Oracle Stack, with products such as Golden Gate, Service Bus and Oracle Stream Analytics all being able to act as a Kafka consumer or producer.
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
Apache Kafka is a new breed of messaging system built for the "big data" world. Coming out of LinkedIn (and donated to Apache), it is a distributed pub/sub system built in Scala. It has been an Apache TLP now for several months with the first Apache release imminent. Built for speed, scalability, and robustness, Kafka should definitely be one of the data tools you consider when designing distributed data-oriented applications.
The talk will cover a general overview of the project and technology, with some use cases, and a demo.
Spark Streaming has supported Kafka since it's inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.Apache Kafka 0.10 (actually since 0.9) introduced the new Consumer API, built on top of a new group coordination protocol provided by Kafka itself.
So a new Spark Streaming integration comes to the playground, with a similar design to the 0.8 Direct DStream approach. However, there are notable differences in usage, and many exciting new features. In this talk, we will cover what are the main differences between this new integration and the previous one (for Kafka 0.8), and why Direct DStreams have replaced Receivers for good. We will also see how to achieve different semantics (at least one, at most one, exactly once) with code examples.
Finally, we will briefly introduce the usage of this integration in Billy Mobile to ingest and process the continuous stream of events from our AdNetwork.
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
The driving question behind redesigns of countless data collection architectures has often been, ?how can we make the data available to our analytical systems faster?? Increasingly, the go-to solution for this data collection problem is Apache Flume. In this talk, architectures and techniques for designing a low-latency Flume-based data collection and delivery system to enable Hadoop-based analytics are explored. Techniques for getting the data into Flume, getting the data onto HDFS and HBase, and making the data available as quickly as possible are discussed. Best practices for scaling up collection, addressing de-duplication, and utilizing a combination streaming/batch model are described in the context of Flume and Hadoop ecosystem components.
Near-realtime analytics with Kafka and HBasedave_revell
A presentation at OSCON 2012 by Nate Putnam and Dave Revell about Urban Airship's analytics stack. Features Kafka, HBase, and Urban Airship's own open source projects statshtable and datacube.
A technical overview on Kafka including key terminologies like topic, partition, offset, Kafka architecture (zookeeper, broker and replication). Also covers Kafka and IT team analogy. Read the article @ https://www.linkedin.com/pulse/kafka-technical-overview-sylvester-daniel/
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...confluent
We have been served well by Zookeeper over the years, but it is time for Kafka to stand on its own. This is a talk on the ongoing effort to replace the use of Zookeeper in Kafka: why we want to do it and how it will work. We will discuss the limitations we have found and how Kafka benefits both in terms of stability and scalability by bringing consensus in house. This effort will not be completed over night, but we will discuss our progress, what work is remaining, and how contributors can help. (Note that I am proposing this as a joint talk with Colin McCabe, who is also a committer on the Apache Kafka project.)
A step-by-step deep dive into Kafka Security world. This presentation covers few most sought-after questions in Streaming / Kafka; like what happens internally when SASL / Kerberos / SSL security is configured, how does various Kafka components interacts with each other. This could be valuable resource for administrators, users & Application developers alike. Having internal Kafka knowledge would help them to configure, manage and use the Kafka systems in a more optimal way with least possible errors / mistakes.
Agenda is to discuss:
- Various Kafka Security model available: PLAINTEXT, SASL_PLAINTEXT, SASL_SSL, PLAINTEXT_SSL and when to use which model
- Anatomy of each Security model: in-depth examination of these models and what happens internally when they are used; with real life examples
- Do's and Don'ts of Kafka Security
- Common Errors & Troubleshooting
This talk will be all about looking under-the-hood with respect to Kafka Security. Suitable for all levels from beginners to expert.
Speaker
Vipin Rathor, Sr. Product Specialist (security), Hortonworks
Kafka is a real-time, fault-tolerant, scalable messaging system.
It is a publish-subscribe system that connects various applications with the help of messages - producers and consumers of information.
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. How can me make sure that all these event are accepted and forwarded in an efficient and reliable way? This is where Apache Kafaka comes into play, a distirbuted, highly-scalable messaging broker, build for exchanging huge amount of messages between a source and a target.
This session will start with an introduction into Apache and presents the role of Apache Kafka in a modern data / information architecture and the advantages it brings to the table. Additionally the Kafka ecosystem will be covered as well as the integration of Kafka in the Oracle Stack, with products such as Golden Gate, Service Bus and Oracle Stream Analytics all being able to act as a Kafka consumer or producer.
Developing Realtime Data Pipelines With Apache KafkaJoe Stein
Developing Realtime Data Pipelines With Apache Kafka. Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers. Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact. Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.
Presentation from kafka meetup 13-SEP-2013. including some notes to clarify some slides. enjoy
Avi Levi
123avi@gmail.com
https://www.linkedin.com/in/leviavi/
How to use kakfa for storing intermediate data and use it as a pub/sub model with each of the Producer/Consumer/Topic configs deeply and the Internals working of it.
In the following slides, we are trying to explore Kafka and Event-Driven Architecture. We try to define what is Kafka platform, how does it work, analyze Kafka API's like ConsumerAPI, ProducerAPI, StreamsAPI. Also we take a look on some core Kafka's configuration before we deploy it on production and we discuss a few best approaches to have a reliable data delivery system using Kafka.
Check out our repository: https://github.com/arconsis/Eshop-EDA
In the following slides, our dear colleagues Dimosthenis Botsaris and Alexandros Koufatzis are trying to explore Kafka and Event-Driven Architecture. They define what is the Kafka platform, how does it work and analyze Kafka API's like ConsumerAPI, ProducerAPI, StreamsAPI. They also take a look on some core Kafka's configuration before they deploy it on production and discuss a few best approaches to have a reliable data delivery system using Kafka.
Check out the repository: https://github.com/arconsis/Eshop-EDA
Purpose of the session is to have a dive into Apache, Kafka, Data Streaming and Kafka in the cloud
- Dive into Apache Kafka
- Data Streaming
- Kafka in the cloud
Apache Kafka - Scalable Message Processing and more!Guido Schmutz
In the world of sensors and social media streams, the integration and handling of high-volume event streams is more important than ever. Events have to be handled both efficiently and reliably and often many consumers or systems are interested in all or part of the events. How do we make sure that all these event are accepted and forwarded in an efficient and reliable way? Apache Kafka, a distributed, highly-scalable messaging broker, build for exchanging huge amount of messages between a source and a target can be of great help in such scenario.
This session introduces Apache Kafka and its place in a modern architecture, shows its integration with Oracle Stack and presents the Oracle Event Hub cloud service, the managed Kafka service.
In this session you will learn:
1. Kafka Overview
2. Need for Kafka
3. Kafka Architecture
4. Kafka Components
5. ZooKeeper Overview
6. Leader Node
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
Grokking TechTalk #24: Kafka's principles and protocolsGrokking VN
Bài talk sẽ giới thiệu về Kafka, và đào sâu về các principles của Kafka, các thiết kế của Kafka để làm Kafka nhanh, scalable và độ ổn định cao. Bài talk cũng chia sẻ về cách Kafka servers tương tác với Kafka clients.
Bài talk đào sâu vào internals của Kafka và phân tích tại sao các design decisions được thiết kế như vậy. Bài talk phù hợp cho các bạn software engineer đã, đang muốn tìm hiểu về các job queue, message queue khác nhau.
Speaker: Nguyen Quang Minh
- Software Engineer, Technical Lead @ Employment Hero
- Contributor of `ruby-kafka` (the most popular Kafka client for Ruby)
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
Apache Kafka Women Who Code Meetup
1. APACHE KAFKA & USECASES WITHIN
SEARCH SYSTEM @WALMARTLABS
- Snehal Nagmote @WalmartLabs
-WOMEN WHO CODE SUNNYVALE MEETUP
2. Todays Agenda
Apache Kafka Intro
Kafka Core Concepts
Kafka Producer - In Detail
Kafka Consumer - In Detail
Kafka Ecosystem
Kafka Streams and Kafka Connect
Search Use Cases @WalmartLabs
3. Kafka decouples Data Pipelines
What is Apache Kafka?
3
Source
System
Source
System
Source
System
Source
System
Hadoop
Security
Systems
Real-time
monitoring
Data
Warehouse
Kafka
Producers
Brokers
Consumers
4. Key terminology
Kafka maintains feeds of messages in categories called
topics.
Processes that publish messages to a Kafka topic are
called producers.
Processes that subscribe to topics and process the feed of
published messages are called consumers.
Kafka is run as a cluster comprised of one or more servers
each of which is called a broker.
Communication between all components is done via a high
performance simple binary API over TCP protocol
7. Why Kafka is so Fast ?
Kafka achieves it’s high throughput and low latency primarily from two key
concepts
1) Batching of individual messages to amortize network overhead and
append/consume chunks together
2) Zero copy I/O using sendfile (Java’s NIO FileChannel transferTo method).
Implements linux sendfile() system call which skips unnecessary copies
Heavily relies on Linux PageCache
The I/O scheduler will batch together consecutive small writes into
bigger physical writes which improves throughput.
The I/O scheduler will attempt to re-sequence writes to minimize
movement of the disk head which improves throughput.
It automatically uses all the free memory on the machine
10. Overview of Part 2: Kafka core
concepts
A first look
Topics, partitions, replicas, offsets
Producers, brokers, consumers
Putting it all together
10
11. A first look
The who is who
Producers write data to brokers.
Consumers read data from brokers.
All this is distributed.
The data
Data is stored in topics.
Topics are split into partitions, which are replicated.
11
12. Topics
12
Broker(s)
new
Producer A1
Producer A2
Producer An
…
Producers always append to “tail”
…
Older msgs Newer msgs
Consumer group C1
Consumers use an “offset pointer” to
track/control their read progress
(and decide the pace of consumption)
Consumer group C2
Topic: feed name to which messages are published
Retention Policy : time Based/size based
Configs : retention.ms (Time based) or size based (retention.bytes)
13. Creation of Topic
Creating a topic
CLI
Auto-create via auto.create.topics.enable = true
Modifying a topic
https://kafka.apache.org/documentation.html#basic_ops_modify_topic
Delete a topic
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-
factor 2 –partitions 3 –topic test-topic
14. Partitions
14
A topic consists of partitions.
Partition: ordered + immutable sequence of
messages
15. Partitions
15
Partitions of a topic is configurable
Partitions determines max consumer (group)
parallelism
Consumer group A, with 2 consumers, reads from a 4-partition topic
Consumer group B, with 4 consumers, reads from the same topic
16. Partition offsets
16
Offset: messages in the partitions are each assigned a
unique (per partition) and sequential id called the
offset
Consumers track their pointers via (offset, partition,
topic) tuples
Consumer group C1
17. Replicas of a partition
17
Replicas: “backups” of a partition
They exist solely to prevent data loss.
Replicas are never read from, never written to.
They do NOT help to increase producer or consumer
parallelism!
Kafka tolerates (numReplicas - 1) dead brokers before
losing data
numReplicas == 2 1 broker can die
18. Topics vs. Partitions vs. Replicas
19
http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/
19. Inspecting the current state of a topic
--describe the topic
Leader: brokerID of the currently elected leader broker
Replica ID’s = broker ID’s
ISR = “in-sync replica”, replicas that are in sync with the leader
In this example:
Broker 0 is leader for partition 1.
Broker 1 is leader for partitions 0 and 2.
All replicas are in-sync with their respective leader partitions.
19
$ kafka-topics.sh --zookeeper zookeeper1:2181 --describe –topic test-topic
Topic:test-topic PartitionCount:3 ReplicationFactor:2 Configs:
Topic: test-topic Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1,0
Topic: test-topic Partition: 1 Leader: 0 Replicas: 0,1 Isr: 0,1
Topic: test-topic Partition: 2 Leader: 1 Replicas: 1,0 Isr: 1,0
20. Let’s recap
The who is who
Producers write data to brokers.
Consumers read data from brokers.
All this is distributed.
The data
Data is stored in topics.
Topics are split into partitions which are replicated.
20
22. Producers -Writing data to Kafka
Producers publish to a topic of their choosing (push)
Load can be distributed
Typically by “round-robin”
Can also do “semantic partitioning” based on a key in the
message
Brokers load balance by partition
Support async (less durable) sending
All nodes can answer metadata requests about:
Which servers are alive
Where leaders are for the partitions of a topic
23. Producer - 0.9
All calls are non-blocking async
2 Options for checking for failures:
Immediately block for response: send().get()
Do followup work in Callback, close producer after error threshold
Be careful about buffering these failures. Future work? KAFKA-
1955
Don’t forget to close the producer! producer.close() will block
until in-flight txns complete
retries (producer config) defaults to 0
message.send.max.retries (server config) defaults to 3
In flight requests could lead to message re-ordering
24. Understand the Kafka producer
Record Accumulator
batch
0
batch
1
topic0, 0
● Serialization
● Partitioning
● Compression
Tasks done by the user threads.
compressor
used M free
callback
s
CB
Topic
Metadata
topic=“topic0”
value=“hello” PartitionerSerializer
topic=“topic0”
partition =0
value=“hello”
2
4
User: producer.send(new ProducerRecord(“topic0”, “hello”), callback);
25. Sender:
1. polls batches from the batch queues (one batch / partition)
2. groups batches based on the leader broker
3. sends the grouped batches to the brokers
4. Pipelining if max.in.flight.requests.per.connection > 1
Understand the Kafka producer
batch1 topic0, 0
topic0, 1
batch
1
batch
2
Broker 0
request 0
batch0 batch0
(one batch / partition)
compressor
used M free
…...
request 1
batch0
Broker 1 topic0, M
sender thread Record Accumulator
2
5
callback
s
CB
26. Producer Apis – (0.9)
/**
* Send the given record asynchronously and return a future which will eventually
contain the response information.
* @return A future which will eventually contain the response information
*/
public Future send(ProducerRecord record);
/**
* Send a record and invoke the given callback when the record has been
acknowledged by the server
*/
public Future send(ProducerRecord record, Callback callback);
}
27. Producer – Message Reordering
Message reordering might happen if:
● max.in.flight.requests.per.connection > 1, AND
● retries are enabled
To Prevent Reordering
● max.in.flight.requests.per.connection=1
● close producer in callback with close(0) on send failure
Kafka
Broker
Producer
message 0
Message 1
Retry message0
Message 0
28. Producer – Configs
batch.size – size based batching
linger.ms – time based batching
More batching-> better compression->higher
throughput->higher latency
compression.type
max.in.flight.requests.per.connection (affects
ordering)
acks (affects durability)
29. Producers – Data Loss
Caveats - Producer can cause data loss when
block.on.buffer.full=false
retries are exhausted
sending message without using acks=all
Solutions
block.on.buffer.full=TRUE – make sure producer does not
throw messages
retries=Long.MAX_VALUE
acks=all
resend in callback when message send failed
29
30. Producers – Ack Settings
3
0
Durability can be configured with the producer configuration
request.required.acks
0 The message is written to the network (buffer)
1 The message is written to the leader
all The producer gets an ack after all ISRs receive the data; the
message is committed
.
acks Throughput Latency Durability
0 high low No guarantee
1 medium medium leader
-1 low high ISR
31. Write operations behind the scenes
When writing to a topic in Kafka, producers write directly to the partition
leaders (brokers) of that topic
Remember: Writes always go to the leader ISR of a partition!
This raises two questions:
How to know the “right” partition for a given topic?
How to know the current leader broker/replica of a partition?
31
32. In Kafka, a producer – i.e. the client – decides to which target partition a
message will be sent.
Can be random ~ load balancing across receiving brokers.
Can be semantic based on message “key”, e.g. by user ID or domain name.
Here, Kafka guarantees that all data for the same key will go to the same partition, so
consumers can make locality assumptions.
1) How to know the “right” partition
when sending?32
33. 2) How to know the current leader of a
partition?
Producers: broker discovery aka bootstrapping
Producers don’t talk to ZooKeeper, so it’s not through ZK.
Broker discovery is achieved by providing producers with a “bootstrapping”
broker list, cf. bootstrap.servers
These brokers inform the producer about all alive brokers and where to find current partition leaders. The
bootstrap brokers do use ZK for that.
Impacts on failure handling
In Kafka 0.8 the bootstrap list is static/immutable during producer run-time.
The current bootstrap approach has improved in Kafka 0.9. This change will make
the life of Ops easier.
33
34. Consumer – Reading data
Multiple Consumers can read from the same topic
Each Consumer is responsible for managing it’s own offset
Messages stay on Kafka…they are not removed after
they are consumed
1234567
1234568
1234569
1234570
1234571
1234572
1234573
1234574
1234575
1234576
1234577
Consumer
Producer
Consumer
Consumer
1234577
Send
Writ
e
Fetc
h
Fetc
h
Fetc
h
35. Consumer
Consumers can go away
1234567
1234568
1234569
1234570
1234571
1234572
1234573
1234574
1234575
1234576
1234577
Consumer
Producer
Consumer
1234577
Send
Writ
e
Fetc
h
Fetc
h
36. Consumer
And then come back
1234567
1234568
1234569
1234570
1234571
1234572
1234573
1234574
1234575
1234576
1234577
Consumer
Producer
Consumer
Consumer
1234577
Send
Writ
e
Fetc
h
Fetc
h
Fetc
h
37. Consumer - Groups
Consumers can be organized into Consumer Groups
Common Patterns:
1) All consumer instances in one group
Acts like a traditional queue with load balancing
2) All consumer instances in different groups
All messages are broadcast to all consumer instances
3) “Logical Subscriber” – Many consumer instances in a
group
Consumers are added for scalability and fault tolerance
Each consumer instance reads from one or more partitions for a
topic
There cannot be more consumer instances than partitions
38. Consumer - Groups
P0 P3 P1 P2
C1 C2 C3 C4 C5 C6
Kafka
ClusterBroker 1 Broker 2
Consumer
Group A
Consumer Group B
Consumer
Groups provide
isolation to
topics and
partitions
39. Consumer – Group Rebalancing
Can rebalance
themselves
P0 P3 P1 P2
C1 C2 C3 C4 C5 C6
Kafka Cluster
Broker 1 Broker 2
Consumer Group A Consumer Group B
X
42. Delivery Semantics
At least once
Messages are never lost but may be redelivered
At most once
Messages are lost but never redelivered
Exactly once
Messages are delivered once and only once
Default
43. Delivery Semantics
At least once
Messages are never lost but may be redelivered
At most once
Messages are lost but never redelivered
Exactly once
Messages are delivered once and only once
Much Harder
(Impossible??)
44. Kafka + X for processing the data?
Kafka + Spark Streaming
Near real time indexing in Search (WalmartLabs)
Kafka Streams – Library for Stream Processing
Kafka + Storm/Heron often used in combination, e.g. Twitter
Samza (since Aug ’13) – also by LinkedIn
“Normal” Java multi-threaded setups
Akka actors with Scala or Java, e.g. Ooyala
Kafka + Camus/goblin for Kafka->Hadoop ingestion
44
https://cwiki.apache.org/confluence/display/KAFKA/Powered+By
45. Kafka Ecosystem
Kafka Connect - Used for writing sources and sinks that either
continuously ingest data into Kafka or continuously ingest data from
Kafka into external systems.
Kafka Streams - Built-in stream processing library of the Apache
Kafka project
Confluent Control Center -
Kafka Manager - A tool for managing Apache Kafka.
Kafka Mirrormaker – A tool to mirror a source Kafka cluster into a
target (mirror) Kafka cluster.
HiveKa – Hive queries on Kafka Topic http://hiveka.weebly.com/
Src : https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem
46. Search Usecases @WalmartLabs
Filter Pick up Today – Search filter allows you to
buy online and pick up from store
Easily scales upto 15-20k events/sec
Near real time indexing pipeline
Uses 9 Kafka topics ~100’s gbs of data
NRT Walmart Sponsored Ads
Application Log Processing
Hubble logs data – Collecting User Activity Data
During holiday – almost scales upto 40-50k events/sec
47. Pick Up Today- Challenges
Processing
• 200 Million events/day
• 5k events/sec on avg
• 12k events/sec at peak
Storage
• 1 item ->5000 stores (5 million*5000)
• Fast retrieval