SlideShare a Scribd company logo
APACHE KAFKA & USECASES WITHIN
SEARCH SYSTEM @WALMARTLABS
- Snehal Nagmote @WalmartLabs
-WOMEN WHO CODE SUNNYVALE MEETUP
Todays Agenda
 Apache Kafka Intro
 Kafka Core Concepts
 Kafka Producer - In Detail
 Kafka Consumer - In Detail
 Kafka Ecosystem
 Kafka Streams and Kafka Connect
 Search Use Cases @WalmartLabs
Kafka decouples Data Pipelines
What is Apache Kafka?
3
Source
System
Source
System
Source
System
Source
System
Hadoop
Security
Systems
Real-time
monitoring
Data
Warehouse
Kafka
Producers
Brokers
Consumers
Key terminology
 Kafka maintains feeds of messages in categories called
topics.
 Processes that publish messages to a Kafka topic are
called producers.
 Processes that subscribe to topics and process the feed of
published messages are called consumers.
 Kafka is run as a cluster comprised of one or more servers
each of which is called a broker.
 Communication between all components is done via a high
performance simple binary API over TCP protocol
Architecture
5
Producer
Consumer Consumer
Producers
Kafka Cluster
Consumers
Broker Broker Broker Broker
Producer
Zookeeper
KafkaVs
src: https://softwaremill.com/mqperf/
Why Kafka is so Fast ?
 Kafka achieves it’s high throughput and low latency primarily from two key
concepts
 1) Batching of individual messages to amortize network overhead and
append/consume chunks together
 2) Zero copy I/O using sendfile (Java’s NIO FileChannel transferTo method).
 Implements linux sendfile() system call which skips unnecessary copies
 Heavily relies on Linux PageCache
 The I/O scheduler will batch together consecutive small writes into
bigger physical writes which improves throughput.
 The I/O scheduler will attempt to re-sequence writes to minimize
movement of the disk head which improves throughput.
 It automatically uses all the free memory on the machine
Zero Copy
Part 2: Kafka core concepts
9
Overview of Part 2: Kafka core
concepts
 A first look
 Topics, partitions, replicas, offsets
 Producers, brokers, consumers
 Putting it all together
10
A first look
 The who is who
 Producers write data to brokers.
 Consumers read data from brokers.
 All this is distributed.
 The data
 Data is stored in topics.
 Topics are split into partitions, which are replicated.
11
Topics
12
Broker(s)
new
Producer A1
Producer A2
Producer An
…
Producers always append to “tail”
…
Older msgs Newer msgs
Consumer group C1
Consumers use an “offset pointer” to
track/control their read progress
(and decide the pace of consumption)
Consumer group C2
Topic: feed name to which messages are published
Retention Policy : time Based/size based
Configs : retention.ms (Time based) or size based (retention.bytes)
Creation of Topic
 Creating a topic
 CLI
 Auto-create via auto.create.topics.enable = true
 Modifying a topic
 https://kafka.apache.org/documentation.html#basic_ops_modify_topic
 Delete a topic
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-
factor 2 –partitions 3 –topic test-topic
Partitions
14
 A topic consists of partitions.
 Partition: ordered + immutable sequence of
messages
Partitions
15
 Partitions of a topic is configurable
 Partitions determines max consumer (group)
parallelism
 Consumer group A, with 2 consumers, reads from a 4-partition topic
 Consumer group B, with 4 consumers, reads from the same topic
Partition offsets
16
Offset: messages in the partitions are each assigned a
unique (per partition) and sequential id called the
offset
 Consumers track their pointers via (offset, partition,
topic) tuples
Consumer group C1
Replicas of a partition
17
Replicas: “backups” of a partition
 They exist solely to prevent data loss.
 Replicas are never read from, never written to.
 They do NOT help to increase producer or consumer
parallelism!
 Kafka tolerates (numReplicas - 1) dead brokers before
losing data
 numReplicas == 2  1 broker can die
Topics vs. Partitions vs. Replicas
19
http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/
Inspecting the current state of a topic
 --describe the topic
 Leader: brokerID of the currently elected leader broker
 Replica ID’s = broker ID’s
 ISR = “in-sync replica”, replicas that are in sync with the leader
 In this example:
 Broker 0 is leader for partition 1.
 Broker 1 is leader for partitions 0 and 2.
 All replicas are in-sync with their respective leader partitions.
19
$ kafka-topics.sh --zookeeper zookeeper1:2181 --describe –topic test-topic
Topic:test-topic PartitionCount:3 ReplicationFactor:2 Configs:
Topic: test-topic Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1,0
Topic: test-topic Partition: 1 Leader: 0 Replicas: 0,1 Isr: 0,1
Topic: test-topic Partition: 2 Leader: 1 Replicas: 1,0 Isr: 1,0
Let’s recap
 The who is who
 Producers write data to brokers.
 Consumers read data from brokers.
 All this is distributed.
 The data
 Data is stored in topics.
 Topics are split into partitions which are replicated.
20
Kafka Producers
21
Producers -Writing data to Kafka
 Producers publish to a topic of their choosing (push)
 Load can be distributed
 Typically by “round-robin”
 Can also do “semantic partitioning” based on a key in the
message
 Brokers load balance by partition
 Support async (less durable) sending
 All nodes can answer metadata requests about:
 Which servers are alive
 Where leaders are for the partitions of a topic
Producer - 0.9
 All calls are non-blocking async
 2 Options for checking for failures:
 Immediately block for response: send().get()
 Do followup work in Callback, close producer after error threshold
 Be careful about buffering these failures. Future work? KAFKA-
1955
 Don’t forget to close the producer! producer.close() will block
until in-flight txns complete
 retries (producer config) defaults to 0
 message.send.max.retries (server config) defaults to 3
 In flight requests could lead to message re-ordering
Understand the Kafka producer
Record Accumulator
batch
0
batch
1
topic0, 0
● Serialization
● Partitioning
● Compression
Tasks done by the user threads.
compressor
used M free
callback
s
CB
Topic
Metadata
topic=“topic0”
value=“hello” PartitionerSerializer
topic=“topic0”
partition =0
value=“hello”
2
4
User: producer.send(new ProducerRecord(“topic0”, “hello”), callback);
Sender:
1. polls batches from the batch queues (one batch / partition)
2. groups batches based on the leader broker
3. sends the grouped batches to the brokers
4. Pipelining if max.in.flight.requests.per.connection > 1
Understand the Kafka producer
batch1 topic0, 0
topic0, 1
batch
1
batch
2
Broker 0
request 0
batch0 batch0
(one batch / partition)
compressor
used M free
…...
request 1
batch0
Broker 1 topic0, M
sender thread Record Accumulator
2
5
callback
s
CB
Producer Apis – (0.9)
/**
* Send the given record asynchronously and return a future which will eventually
contain the response information.
* @return A future which will eventually contain the response information
*/
public Future send(ProducerRecord record);
/**
* Send a record and invoke the given callback when the record has been
acknowledged by the server
*/
public Future send(ProducerRecord record, Callback callback);
}
Producer – Message Reordering
Message reordering might happen if:
● max.in.flight.requests.per.connection > 1, AND
● retries are enabled
To Prevent Reordering
● max.in.flight.requests.per.connection=1
● close producer in callback with close(0) on send failure
Kafka
Broker
Producer
message 0
Message 1
Retry message0
Message 0
Producer – Configs
 batch.size – size based batching
 linger.ms – time based batching
 More batching-> better compression->higher
throughput->higher latency
 compression.type
 max.in.flight.requests.per.connection (affects
ordering)
 acks (affects durability)
Producers – Data Loss
 Caveats - Producer can cause data loss when
 block.on.buffer.full=false
 retries are exhausted
 sending message without using acks=all
 Solutions
 block.on.buffer.full=TRUE – make sure producer does not
throw messages
 retries=Long.MAX_VALUE
 acks=all
 resend in callback when message send failed
29
Producers – Ack Settings
3
0
 Durability can be configured with the producer configuration
request.required.acks
 0 The message is written to the network (buffer)
 1 The message is written to the leader
 all The producer gets an ack after all ISRs receive the data; the
message is committed
.
acks Throughput Latency Durability
0 high low No guarantee
1 medium medium leader
-1 low high ISR
Write operations behind the scenes
 When writing to a topic in Kafka, producers write directly to the partition
leaders (brokers) of that topic
 Remember: Writes always go to the leader ISR of a partition!
 This raises two questions:
 How to know the “right” partition for a given topic?
 How to know the current leader broker/replica of a partition?
31
 In Kafka, a producer – i.e. the client – decides to which target partition a
message will be sent.
 Can be random ~ load balancing across receiving brokers.
 Can be semantic based on message “key”, e.g. by user ID or domain name.
 Here, Kafka guarantees that all data for the same key will go to the same partition, so
consumers can make locality assumptions.
1) How to know the “right” partition
when sending?32
2) How to know the current leader of a
partition?
 Producers: broker discovery aka bootstrapping
 Producers don’t talk to ZooKeeper, so it’s not through ZK.
 Broker discovery is achieved by providing producers with a “bootstrapping”
broker list, cf. bootstrap.servers
 These brokers inform the producer about all alive brokers and where to find current partition leaders. The
bootstrap brokers do use ZK for that.
 Impacts on failure handling
 In Kafka 0.8 the bootstrap list is static/immutable during producer run-time.
 The current bootstrap approach has improved in Kafka 0.9. This change will make
the life of Ops easier.
33
Consumer – Reading data
 Multiple Consumers can read from the same topic
 Each Consumer is responsible for managing it’s own offset
 Messages stay on Kafka…they are not removed after
they are consumed
1234567
1234568
1234569
1234570
1234571
1234572
1234573
1234574
1234575
1234576
1234577
Consumer
Producer
Consumer
Consumer
1234577
Send
Writ
e
Fetc
h
Fetc
h
Fetc
h
Consumer
 Consumers can go away
1234567
1234568
1234569
1234570
1234571
1234572
1234573
1234574
1234575
1234576
1234577
Consumer
Producer
Consumer
1234577
Send
Writ
e
Fetc
h
Fetc
h
Consumer
 And then come back
1234567
1234568
1234569
1234570
1234571
1234572
1234573
1234574
1234575
1234576
1234577
Consumer
Producer
Consumer
Consumer
1234577
Send
Writ
e
Fetc
h
Fetc
h
Fetc
h
Consumer - Groups
 Consumers can be organized into Consumer Groups
Common Patterns:
 1) All consumer instances in one group
 Acts like a traditional queue with load balancing
 2) All consumer instances in different groups
 All messages are broadcast to all consumer instances
 3) “Logical Subscriber” – Many consumer instances in a
group
 Consumers are added for scalability and fault tolerance
 Each consumer instance reads from one or more partitions for a
topic
 There cannot be more consumer instances than partitions
Consumer - Groups
P0 P3 P1 P2
C1 C2 C3 C4 C5 C6
Kafka
ClusterBroker 1 Broker 2
Consumer
Group A
Consumer Group B
Consumer
Groups provide
isolation to
topics and
partitions
Consumer – Group Rebalancing
Can rebalance
themselves
P0 P3 P1 P2
C1 C2 C3 C4 C5 C6
Kafka Cluster
Broker 1 Broker 2
Consumer Group A Consumer Group B
X
Rebalancing – In detail
New consumer Java API in 0.9.0
41
Configure
Subscribe
Process
Delivery Semantics
 At least once
 Messages are never lost but may be redelivered
 At most once
 Messages are lost but never redelivered
 Exactly once
 Messages are delivered once and only once
Default
Delivery Semantics
 At least once
 Messages are never lost but may be redelivered
 At most once
 Messages are lost but never redelivered
 Exactly once
 Messages are delivered once and only once
Much Harder
(Impossible??)
Kafka + X for processing the data?
 Kafka + Spark Streaming
 Near real time indexing in Search (WalmartLabs)
 Kafka Streams – Library for Stream Processing
 Kafka + Storm/Heron often used in combination, e.g. Twitter
 Samza (since Aug ’13) – also by LinkedIn
 “Normal” Java multi-threaded setups
 Akka actors with Scala or Java, e.g. Ooyala
 Kafka + Camus/goblin for Kafka->Hadoop ingestion
44
https://cwiki.apache.org/confluence/display/KAFKA/Powered+By
Kafka Ecosystem
 Kafka Connect - Used for writing sources and sinks that either
continuously ingest data into Kafka or continuously ingest data from
Kafka into external systems.
 Kafka Streams - Built-in stream processing library of the Apache
Kafka project
 Confluent Control Center -
 Kafka Manager - A tool for managing Apache Kafka.
 Kafka Mirrormaker – A tool to mirror a source Kafka cluster into a
target (mirror) Kafka cluster.
 HiveKa – Hive queries on Kafka Topic http://hiveka.weebly.com/
Src : https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem
Search Usecases @WalmartLabs
 Filter Pick up Today – Search filter allows you to
buy online and pick up from store
 Easily scales upto 15-20k events/sec
 Near real time indexing pipeline
 Uses 9 Kafka topics ~100’s gbs of data
 NRT Walmart Sponsored Ads
 Application Log Processing
 Hubble logs data – Collecting User Activity Data
 During holiday – almost scales upto 40-50k events/sec
Pick Up Today- Challenges
 Processing
• 200 Million events/day
• 5k events/sec on avg
• 12k events/sec at peak
 Storage
• 1 item ->5000 stores (5 million*5000)
• Fast retrieval
Pick Up Today – High Level Overview
Implementation Considerations
 Reprocessing
 Delivery Semantics
 Recovery from Streaming Failures
 Monitoring
 Consumer Lag Monitoring
 Apache Burrow
 Consumer Offset Checker
Streaming Component Interaction
Questions ?

More Related Content

What's hot

Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupGwen (Chen) Shapira
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah
 
Spark streaming + kafka 0.10
Spark streaming + kafka 0.10Spark streaming + kafka 0.10
Spark streaming + kafka 0.10
Joan Viladrosa Riera
 
Hadoop engineering bo_f_final
Hadoop engineering bo_f_finalHadoop engineering bo_f_final
Hadoop engineering bo_f_final
Ramya Sunil
 
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
kawamuray
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
Joe Stein
 
Kafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internalsKafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internals
Ayyappadas Ravindran (Appu)
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
DataWorks Summit
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
dave_revell
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
Junping Du
 
Kafka Technical Overview
Kafka Technical OverviewKafka Technical Overview
Kafka Technical Overview
Sylvester John
 
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
confluent
 
Kafka Connect
Kafka ConnectKafka Connect
Kafka Connect
Oleg Kuznetsov
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Shravan (Sean) Pabba
 
Visualizing Kafka Security
Visualizing Kafka SecurityVisualizing Kafka Security
Visualizing Kafka Security
DataWorks Summit
 
Apache Kafka® Security Overview
Apache Kafka® Security OverviewApache Kafka® Security Overview
Apache Kafka® Security Overview
confluent
 
Apache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewApache Kafka - Messaging System Overview
Apache Kafka - Messaging System Overview
Dmitry Tolpeko
 

What's hot (20)

Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Spark streaming + kafka 0.10
Spark streaming + kafka 0.10Spark streaming + kafka 0.10
Spark streaming + kafka 0.10
 
Hadoop engineering bo_f_final
Hadoop engineering bo_f_finalHadoop engineering bo_f_final
Hadoop engineering bo_f_final
 
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
 
Nov 2011 HUG: Blur - Lucene on Hadoop
Nov 2011 HUG: Blur - Lucene on HadoopNov 2011 HUG: Blur - Lucene on Hadoop
Nov 2011 HUG: Blur - Lucene on Hadoop
 
Kafka basics
Kafka basicsKafka basics
Kafka basics
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Kafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internalsKafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internals
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Kafka Technical Overview
Kafka Technical OverviewKafka Technical Overview
Kafka Technical Overview
 
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
 
Kafka Connect
Kafka ConnectKafka Connect
Kafka Connect
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Visualizing Kafka Security
Visualizing Kafka SecurityVisualizing Kafka Security
Visualizing Kafka Security
 
Apache Kafka® Security Overview
Apache Kafka® Security OverviewApache Kafka® Security Overview
Apache Kafka® Security Overview
 
Apache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewApache Kafka - Messaging System Overview
Apache Kafka - Messaging System Overview
 

Similar to Apache Kafka Women Who Code Meetup

Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & PartitioningApache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Guido Schmutz
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
Joe Stein
 
Kafka zero to hero
Kafka zero to heroKafka zero to hero
Kafka zero to hero
Avi Levi
 
Apache Kafka - From zero to hero
Apache Kafka - From zero to heroApache Kafka - From zero to hero
Apache Kafka - From zero to hero
Apache Kafka TLV
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
Aparna Pillai
 
Kafka Deep Dive
Kafka Deep DiveKafka Deep Dive
Kafka Deep Dive
Knoldus Inc.
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-CamusDeep Shah
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Driven
Dimosthenis Botsaris
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Driven
arconsis
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
confluent
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!
Guido Schmutz
 
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson LearnedApache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Guozhang Wang
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and Zookeeper
AnandMHadoop
 
Kafka 10000 feet view
Kafka 10000 feet viewKafka 10000 feet view
Kafka 10000 feet view
younessx01
 
Grokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocols
Grokking VN
 
Kafka overview
Kafka overviewKafka overview
Kafka overview
Shanki Singh Gandhi
 
Proof of Concept on Kafka.pptx
Proof of Concept on Kafka.pptxProof of Concept on Kafka.pptx
Proof of Concept on Kafka.pptx
ssuser92147e
 
04-Kafka.pptx
04-Kafka.pptx04-Kafka.pptx
04-Kafka.pptx
MannMehta13
 

Similar to Apache Kafka Women Who Code Meetup (20)

Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & PartitioningApache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
 
Kafka zero to hero
Kafka zero to heroKafka zero to hero
Kafka zero to hero
 
Apache Kafka - From zero to hero
Apache Kafka - From zero to heroApache Kafka - From zero to hero
Apache Kafka - From zero to hero
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Kafka Deep Dive
Kafka Deep DiveKafka Deep Dive
Kafka Deep Dive
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-Camus
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Driven
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Driven
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!
 
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson LearnedApache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and Zookeeper
 
Kafka 10000 feet view
Kafka 10000 feet viewKafka 10000 feet view
Kafka 10000 feet view
 
Grokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocols
 
Kafka overview
Kafka overviewKafka overview
Kafka overview
 
Proof of Concept on Kafka.pptx
Proof of Concept on Kafka.pptxProof of Concept on Kafka.pptx
Proof of Concept on Kafka.pptx
 
04-Kafka.pptx
04-Kafka.pptx04-Kafka.pptx
04-Kafka.pptx
 

Recently uploaded

ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 

Recently uploaded (20)

ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 

Apache Kafka Women Who Code Meetup

  • 1. APACHE KAFKA & USECASES WITHIN SEARCH SYSTEM @WALMARTLABS - Snehal Nagmote @WalmartLabs -WOMEN WHO CODE SUNNYVALE MEETUP
  • 2. Todays Agenda  Apache Kafka Intro  Kafka Core Concepts  Kafka Producer - In Detail  Kafka Consumer - In Detail  Kafka Ecosystem  Kafka Streams and Kafka Connect  Search Use Cases @WalmartLabs
  • 3. Kafka decouples Data Pipelines What is Apache Kafka? 3 Source System Source System Source System Source System Hadoop Security Systems Real-time monitoring Data Warehouse Kafka Producers Brokers Consumers
  • 4. Key terminology  Kafka maintains feeds of messages in categories called topics.  Processes that publish messages to a Kafka topic are called producers.  Processes that subscribe to topics and process the feed of published messages are called consumers.  Kafka is run as a cluster comprised of one or more servers each of which is called a broker.  Communication between all components is done via a high performance simple binary API over TCP protocol
  • 7. Why Kafka is so Fast ?  Kafka achieves it’s high throughput and low latency primarily from two key concepts  1) Batching of individual messages to amortize network overhead and append/consume chunks together  2) Zero copy I/O using sendfile (Java’s NIO FileChannel transferTo method).  Implements linux sendfile() system call which skips unnecessary copies  Heavily relies on Linux PageCache  The I/O scheduler will batch together consecutive small writes into bigger physical writes which improves throughput.  The I/O scheduler will attempt to re-sequence writes to minimize movement of the disk head which improves throughput.  It automatically uses all the free memory on the machine
  • 9. Part 2: Kafka core concepts 9
  • 10. Overview of Part 2: Kafka core concepts  A first look  Topics, partitions, replicas, offsets  Producers, brokers, consumers  Putting it all together 10
  • 11. A first look  The who is who  Producers write data to brokers.  Consumers read data from brokers.  All this is distributed.  The data  Data is stored in topics.  Topics are split into partitions, which are replicated. 11
  • 12. Topics 12 Broker(s) new Producer A1 Producer A2 Producer An … Producers always append to “tail” … Older msgs Newer msgs Consumer group C1 Consumers use an “offset pointer” to track/control their read progress (and decide the pace of consumption) Consumer group C2 Topic: feed name to which messages are published Retention Policy : time Based/size based Configs : retention.ms (Time based) or size based (retention.bytes)
  • 13. Creation of Topic  Creating a topic  CLI  Auto-create via auto.create.topics.enable = true  Modifying a topic  https://kafka.apache.org/documentation.html#basic_ops_modify_topic  Delete a topic bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication- factor 2 –partitions 3 –topic test-topic
  • 14. Partitions 14  A topic consists of partitions.  Partition: ordered + immutable sequence of messages
  • 15. Partitions 15  Partitions of a topic is configurable  Partitions determines max consumer (group) parallelism  Consumer group A, with 2 consumers, reads from a 4-partition topic  Consumer group B, with 4 consumers, reads from the same topic
  • 16. Partition offsets 16 Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset  Consumers track their pointers via (offset, partition, topic) tuples Consumer group C1
  • 17. Replicas of a partition 17 Replicas: “backups” of a partition  They exist solely to prevent data loss.  Replicas are never read from, never written to.  They do NOT help to increase producer or consumer parallelism!  Kafka tolerates (numReplicas - 1) dead brokers before losing data  numReplicas == 2  1 broker can die
  • 18. Topics vs. Partitions vs. Replicas 19 http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/
  • 19. Inspecting the current state of a topic  --describe the topic  Leader: brokerID of the currently elected leader broker  Replica ID’s = broker ID’s  ISR = “in-sync replica”, replicas that are in sync with the leader  In this example:  Broker 0 is leader for partition 1.  Broker 1 is leader for partitions 0 and 2.  All replicas are in-sync with their respective leader partitions. 19 $ kafka-topics.sh --zookeeper zookeeper1:2181 --describe –topic test-topic Topic:test-topic PartitionCount:3 ReplicationFactor:2 Configs: Topic: test-topic Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1,0 Topic: test-topic Partition: 1 Leader: 0 Replicas: 0,1 Isr: 0,1 Topic: test-topic Partition: 2 Leader: 1 Replicas: 1,0 Isr: 1,0
  • 20. Let’s recap  The who is who  Producers write data to brokers.  Consumers read data from brokers.  All this is distributed.  The data  Data is stored in topics.  Topics are split into partitions which are replicated. 20
  • 22. Producers -Writing data to Kafka  Producers publish to a topic of their choosing (push)  Load can be distributed  Typically by “round-robin”  Can also do “semantic partitioning” based on a key in the message  Brokers load balance by partition  Support async (less durable) sending  All nodes can answer metadata requests about:  Which servers are alive  Where leaders are for the partitions of a topic
  • 23. Producer - 0.9  All calls are non-blocking async  2 Options for checking for failures:  Immediately block for response: send().get()  Do followup work in Callback, close producer after error threshold  Be careful about buffering these failures. Future work? KAFKA- 1955  Don’t forget to close the producer! producer.close() will block until in-flight txns complete  retries (producer config) defaults to 0  message.send.max.retries (server config) defaults to 3  In flight requests could lead to message re-ordering
  • 24. Understand the Kafka producer Record Accumulator batch 0 batch 1 topic0, 0 ● Serialization ● Partitioning ● Compression Tasks done by the user threads. compressor used M free callback s CB Topic Metadata topic=“topic0” value=“hello” PartitionerSerializer topic=“topic0” partition =0 value=“hello” 2 4 User: producer.send(new ProducerRecord(“topic0”, “hello”), callback);
  • 25. Sender: 1. polls batches from the batch queues (one batch / partition) 2. groups batches based on the leader broker 3. sends the grouped batches to the brokers 4. Pipelining if max.in.flight.requests.per.connection > 1 Understand the Kafka producer batch1 topic0, 0 topic0, 1 batch 1 batch 2 Broker 0 request 0 batch0 batch0 (one batch / partition) compressor used M free …... request 1 batch0 Broker 1 topic0, M sender thread Record Accumulator 2 5 callback s CB
  • 26. Producer Apis – (0.9) /** * Send the given record asynchronously and return a future which will eventually contain the response information. * @return A future which will eventually contain the response information */ public Future send(ProducerRecord record); /** * Send a record and invoke the given callback when the record has been acknowledged by the server */ public Future send(ProducerRecord record, Callback callback); }
  • 27. Producer – Message Reordering Message reordering might happen if: ● max.in.flight.requests.per.connection > 1, AND ● retries are enabled To Prevent Reordering ● max.in.flight.requests.per.connection=1 ● close producer in callback with close(0) on send failure Kafka Broker Producer message 0 Message 1 Retry message0 Message 0
  • 28. Producer – Configs  batch.size – size based batching  linger.ms – time based batching  More batching-> better compression->higher throughput->higher latency  compression.type  max.in.flight.requests.per.connection (affects ordering)  acks (affects durability)
  • 29. Producers – Data Loss  Caveats - Producer can cause data loss when  block.on.buffer.full=false  retries are exhausted  sending message without using acks=all  Solutions  block.on.buffer.full=TRUE – make sure producer does not throw messages  retries=Long.MAX_VALUE  acks=all  resend in callback when message send failed 29
  • 30. Producers – Ack Settings 3 0  Durability can be configured with the producer configuration request.required.acks  0 The message is written to the network (buffer)  1 The message is written to the leader  all The producer gets an ack after all ISRs receive the data; the message is committed . acks Throughput Latency Durability 0 high low No guarantee 1 medium medium leader -1 low high ISR
  • 31. Write operations behind the scenes  When writing to a topic in Kafka, producers write directly to the partition leaders (brokers) of that topic  Remember: Writes always go to the leader ISR of a partition!  This raises two questions:  How to know the “right” partition for a given topic?  How to know the current leader broker/replica of a partition? 31
  • 32.  In Kafka, a producer – i.e. the client – decides to which target partition a message will be sent.  Can be random ~ load balancing across receiving brokers.  Can be semantic based on message “key”, e.g. by user ID or domain name.  Here, Kafka guarantees that all data for the same key will go to the same partition, so consumers can make locality assumptions. 1) How to know the “right” partition when sending?32
  • 33. 2) How to know the current leader of a partition?  Producers: broker discovery aka bootstrapping  Producers don’t talk to ZooKeeper, so it’s not through ZK.  Broker discovery is achieved by providing producers with a “bootstrapping” broker list, cf. bootstrap.servers  These brokers inform the producer about all alive brokers and where to find current partition leaders. The bootstrap brokers do use ZK for that.  Impacts on failure handling  In Kafka 0.8 the bootstrap list is static/immutable during producer run-time.  The current bootstrap approach has improved in Kafka 0.9. This change will make the life of Ops easier. 33
  • 34. Consumer – Reading data  Multiple Consumers can read from the same topic  Each Consumer is responsible for managing it’s own offset  Messages stay on Kafka…they are not removed after they are consumed 1234567 1234568 1234569 1234570 1234571 1234572 1234573 1234574 1234575 1234576 1234577 Consumer Producer Consumer Consumer 1234577 Send Writ e Fetc h Fetc h Fetc h
  • 35. Consumer  Consumers can go away 1234567 1234568 1234569 1234570 1234571 1234572 1234573 1234574 1234575 1234576 1234577 Consumer Producer Consumer 1234577 Send Writ e Fetc h Fetc h
  • 36. Consumer  And then come back 1234567 1234568 1234569 1234570 1234571 1234572 1234573 1234574 1234575 1234576 1234577 Consumer Producer Consumer Consumer 1234577 Send Writ e Fetc h Fetc h Fetc h
  • 37. Consumer - Groups  Consumers can be organized into Consumer Groups Common Patterns:  1) All consumer instances in one group  Acts like a traditional queue with load balancing  2) All consumer instances in different groups  All messages are broadcast to all consumer instances  3) “Logical Subscriber” – Many consumer instances in a group  Consumers are added for scalability and fault tolerance  Each consumer instance reads from one or more partitions for a topic  There cannot be more consumer instances than partitions
  • 38. Consumer - Groups P0 P3 P1 P2 C1 C2 C3 C4 C5 C6 Kafka ClusterBroker 1 Broker 2 Consumer Group A Consumer Group B Consumer Groups provide isolation to topics and partitions
  • 39. Consumer – Group Rebalancing Can rebalance themselves P0 P3 P1 P2 C1 C2 C3 C4 C5 C6 Kafka Cluster Broker 1 Broker 2 Consumer Group A Consumer Group B X
  • 41. New consumer Java API in 0.9.0 41 Configure Subscribe Process
  • 42. Delivery Semantics  At least once  Messages are never lost but may be redelivered  At most once  Messages are lost but never redelivered  Exactly once  Messages are delivered once and only once Default
  • 43. Delivery Semantics  At least once  Messages are never lost but may be redelivered  At most once  Messages are lost but never redelivered  Exactly once  Messages are delivered once and only once Much Harder (Impossible??)
  • 44. Kafka + X for processing the data?  Kafka + Spark Streaming  Near real time indexing in Search (WalmartLabs)  Kafka Streams – Library for Stream Processing  Kafka + Storm/Heron often used in combination, e.g. Twitter  Samza (since Aug ’13) – also by LinkedIn  “Normal” Java multi-threaded setups  Akka actors with Scala or Java, e.g. Ooyala  Kafka + Camus/goblin for Kafka->Hadoop ingestion 44 https://cwiki.apache.org/confluence/display/KAFKA/Powered+By
  • 45. Kafka Ecosystem  Kafka Connect - Used for writing sources and sinks that either continuously ingest data into Kafka or continuously ingest data from Kafka into external systems.  Kafka Streams - Built-in stream processing library of the Apache Kafka project  Confluent Control Center -  Kafka Manager - A tool for managing Apache Kafka.  Kafka Mirrormaker – A tool to mirror a source Kafka cluster into a target (mirror) Kafka cluster.  HiveKa – Hive queries on Kafka Topic http://hiveka.weebly.com/ Src : https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem
  • 46. Search Usecases @WalmartLabs  Filter Pick up Today – Search filter allows you to buy online and pick up from store  Easily scales upto 15-20k events/sec  Near real time indexing pipeline  Uses 9 Kafka topics ~100’s gbs of data  NRT Walmart Sponsored Ads  Application Log Processing  Hubble logs data – Collecting User Activity Data  During holiday – almost scales upto 40-50k events/sec
  • 47. Pick Up Today- Challenges  Processing • 200 Million events/day • 5k events/sec on avg • 12k events/sec at peak  Storage • 1 item ->5000 stores (5 million*5000) • Fast retrieval
  • 48. Pick Up Today – High Level Overview
  • 49. Implementation Considerations  Reprocessing  Delivery Semantics  Recovery from Streaming Failures  Monitoring  Consumer Lag Monitoring  Apache Burrow  Consumer Offset Checker