Apache Kafka Women Who Code Meetup

APACHE KAFKA & USECASES WITHIN
SEARCH SYSTEM @WALMARTLABS
- Snehal Nagmote @WalmartLabs
-WOMEN WHO CODE SUNNYVALE MEETUP

Todays Agenda
 Apache Kafka Intro
 Kafka Core Concepts
 Kafka Producer - In Detail
 Kafka Consumer - In Detail
 Kafka Ecosystem
 Kafka Streams and Kafka Connect
 Search Use Cases @WalmartLabs

Kafka decouples Data Pipelines
What is Apache Kafka?
3
Source
System
Source
System
Source
System
Source
System
Hadoop
Security
Systems
Real-time
monitoring
Data
Warehouse
Kafka
Producers
Brokers
Consumers

Key terminology
 Kafka maintains feeds of messages in categories called
topics.
 Processes that publish messages to a Kafka topic are
called producers.
 Processes that subscribe to topics and process the feed of
published messages are called consumers.
 Kafka is run as a cluster comprised of one or more servers
each of which is called a broker.
 Communication between all components is done via a high
performance simple binary API over TCP protocol

Architecture
5
Producer
Consumer Consumer
Producers
Kafka Cluster
Consumers
Broker Broker Broker Broker
Producer
Zookeeper

KafkaVs
src: https://softwaremill.com/mqperf/

Why Kafka is so Fast ?
 Kafka achieves it’s high throughput and low latency primarily from two key
concepts
 1) Batching of individual messages to amortize network overhead and
append/consume chunks together
 2) Zero copy I/O using sendfile (Java’s NIO FileChannel transferTo method).
 Implements linux sendfile() system call which skips unnecessary copies
 Heavily relies on Linux PageCache
 The I/O scheduler will batch together consecutive small writes into
bigger physical writes which improves throughput.
 The I/O scheduler will attempt to re-sequence writes to minimize
movement of the disk head which improves throughput.
 It automatically uses all the free memory on the machine

Overview of Part 2: Kafka core
concepts
 A first look
 Topics, partitions, replicas, offsets
 Producers, brokers, consumers
 Putting it all together
10

A first look
 The who is who
 Producers write data to brokers.
 Consumers read data from brokers.
 All this is distributed.
 The data
 Data is stored in topics.
 Topics are split into partitions, which are replicated.
11

Topics
12
Broker(s)
new
Producer A1
Producer A2
Producer An
…
Producers always append to “tail”
…
Older msgs Newer msgs
Consumer group C1
Consumers use an “offset pointer” to
track/control their read progress
(and decide the pace of consumption)
Consumer group C2
Topic: feed name to which messages are published
Retention Policy : time Based/size based
Configs : retention.ms (Time based) or size based (retention.bytes)

Creation of Topic
 Creating a topic
 CLI
 Auto-create via auto.create.topics.enable = true
 Modifying a topic
 https://kafka.apache.org/documentation.html#basic_ops_modify_topic
 Delete a topic
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-
factor 2 –partitions 3 –topic test-topic

Partitions
14
 A topic consists of partitions.
 Partition: ordered + immutable sequence of
messages

Partitions
15
 Partitions of a topic is configurable
 Partitions determines max consumer (group)
parallelism
 Consumer group A, with 2 consumers, reads from a 4-partition topic
 Consumer group B, with 4 consumers, reads from the same topic

Partition offsets
16
Offset: messages in the partitions are each assigned a
unique (per partition) and sequential id called the
offset
 Consumers track their pointers via (offset, partition,
topic) tuples
Consumer group C1

Replicas of a partition
17
Replicas: “backups” of a partition
 They exist solely to prevent data loss.
 Replicas are never read from, never written to.
 They do NOT help to increase producer or consumer
parallelism!
 Kafka tolerates (numReplicas - 1) dead brokers before
losing data
 numReplicas == 2  1 broker can die

Topics vs. Partitions vs. Replicas
19
http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/

Inspecting the current state of a topic
 --describe the topic
 Leader: brokerID of the currently elected leader broker
 Replica ID’s = broker ID’s
 ISR = “in-sync replica”, replicas that are in sync with the leader
 In this example:
 Broker 0 is leader for partition 1.
 Broker 1 is leader for partitions 0 and 2.
 All replicas are in-sync with their respective leader partitions.
19
$ kafka-topics.sh --zookeeper zookeeper1:2181 --describe –topic test-topic
Topic:test-topic PartitionCount:3 ReplicationFactor:2 Configs:
Topic: test-topic Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1,0

Let’s recap
 The who is who
 Producers write data to brokers.
 Consumers read data from brokers.
 All this is distributed.
 The data
 Data is stored in topics.
 Topics are split into partitions which are replicated.
20

Producers -Writing data to Kafka
 Producers publish to a topic of their choosing (push)
 Load can be distributed
 Typically by “round-robin”
 Can also do “semantic partitioning” based on a key in the
message
 Brokers load balance by partition
 Support async (less durable) sending
 All nodes can answer metadata requests about:
 Which servers are alive
 Where leaders are for the partitions of a topic

Producer - 0.9
 All calls are non-blocking async
 2 Options for checking for failures:
 Immediately block for response: send().get()
 Do followup work in Callback, close producer after error threshold
 Be careful about buffering these failures. Future work? KAFKA-
1955
 Don’t forget to close the producer! producer.close() will block
until in-flight txns complete
 retries (producer config) defaults to 0
 message.send.max.retries (server config) defaults to 3
 In flight requests could lead to message re-ordering

Understand the Kafka producer
Record Accumulator
batch
0
batch
1
topic0, 0
● Serialization
● Partitioning
● Compression
Tasks done by the user threads.
compressor
used M free
callback
s
CB
Topic
Metadata
topic=“topic0”
value=“hello” PartitionerSerializer
topic=“topic0”
partition =0
value=“hello”
2
4
User: producer.send(new ProducerRecord(“topic0”, “hello”), callback);

Sender:
1. polls batches from the batch queues (one batch / partition)
2. groups batches based on the leader broker
3. sends the grouped batches to the brokers
4. Pipelining if max.in.flight.requests.per.connection > 1
Understand the Kafka producer
batch1 topic0, 0
topic0, 1
batch
1
batch
2
Broker 0
request 0
batch0 batch0
(one batch / partition)
compressor
used M free
…...
request 1
batch0
Broker 1 topic0, M
sender thread Record Accumulator
2
5
callback
s
CB

Producer Apis – (0.9)
/**
* Send the given record asynchronously and return a future which will eventually
contain the response information.
* @return A future which will eventually contain the response information
*/
public Future send(ProducerRecord record);
/**
* Send a record and invoke the given callback when the record has been
acknowledged by the server
*/
public Future send(ProducerRecord record, Callback callback);
}

Producer – Message Reordering
Message reordering might happen if:
● max.in.flight.requests.per.connection > 1, AND
● retries are enabled
To Prevent Reordering
● max.in.flight.requests.per.connection=1
● close producer in callback with close(0) on send failure
Kafka
Broker
Producer
message 0
Message 1
Retry message0
Message 0

Producer – Configs
 batch.size – size based batching
 linger.ms – time based batching
 More batching-> better compression->higher
throughput->higher latency
 compression.type
 max.in.flight.requests.per.connection (affects
ordering)
 acks (affects durability)

Producers – Data Loss
 Caveats - Producer can cause data loss when
 block.on.buffer.full=false
 retries are exhausted
 sending message without using acks=all
 Solutions
 block.on.buffer.full=TRUE – make sure producer does not
throw messages
 retries=Long.MAX_VALUE
 acks=all
 resend in callback when message send failed
29

Producers – Ack Settings
3
0
 Durability can be configured with the producer configuration
request.required.acks
 0 The message is written to the network (buffer)
 1 The message is written to the leader
 all The producer gets an ack after all ISRs receive the data; the
message is committed
.
acks Throughput Latency Durability
0 high low No guarantee
1 medium medium leader
-1 low high ISR

Write operations behind the scenes
 When writing to a topic in Kafka, producers write directly to the partition
leaders (brokers) of that topic
 Remember: Writes always go to the leader ISR of a partition!
 This raises two questions:
 How to know the “right” partition for a given topic?
 How to know the current leader broker/replica of a partition?
31

 In Kafka, a producer – i.e. the client – decides to which target partition a
message will be sent.
 Can be random ~ load balancing across receiving brokers.
 Can be semantic based on message “key”, e.g. by user ID or domain name.
 Here, Kafka guarantees that all data for the same key will go to the same partition, so
consumers can make locality assumptions.
1) How to know the “right” partition
when sending?32

2) How to know the current leader of a
partition?
 Producers: broker discovery aka bootstrapping
 Producers don’t talk to ZooKeeper, so it’s not through ZK.
 Broker discovery is achieved by providing producers with a “bootstrapping”
broker list, cf. bootstrap.servers
 These brokers inform the producer about all alive brokers and where to find current partition leaders. The
bootstrap brokers do use ZK for that.
 Impacts on failure handling
 In Kafka 0.8 the bootstrap list is static/immutable during producer run-time.
 The current bootstrap approach has improved in Kafka 0.9. This change will make
the life of Ops easier.
33

Consumer – Reading data
 Multiple Consumers can read from the same topic
 Each Consumer is responsible for managing it’s own offset
 Messages stay on Kafka…they are not removed after
they are consumed
1234567
1234568
1234569
1234570
1234571
1234572
1234573
1234574
1234575
1234576
1234577
Consumer
Producer
Consumer
Consumer
1234577
Send
Writ
e
Fetc
h
Fetc
h
Fetc
h

Consumer
 Consumers can go away
1234567
1234568
1234569
1234570
1234571
1234572
1234573
1234574
1234575
1234576
1234577
Consumer
Producer
Consumer
1234577
Send
Writ
e
Fetc
h
Fetc
h

Consumer
 And then come back
1234567
1234568
1234569
1234570
1234571
1234572
1234573
1234574
1234575
1234576
1234577
Consumer
Producer
Consumer
Consumer
1234577
Send
Writ
e
Fetc
h
Fetc
h
Fetc
h

Consumer - Groups
 Consumers can be organized into Consumer Groups
Common Patterns:
 1) All consumer instances in one group
 Acts like a traditional queue with load balancing
 2) All consumer instances in different groups
 All messages are broadcast to all consumer instances
 3) “Logical Subscriber” – Many consumer instances in a
group
 Consumers are added for scalability and fault tolerance
 Each consumer instance reads from one or more partitions for a
topic
 There cannot be more consumer instances than partitions

Consumer - Groups
P0 P3 P1 P2
C1 C2 C3 C4 C5 C6
Kafka
ClusterBroker 1 Broker 2
Consumer
Group A
Consumer Group B
Consumer
Groups provide
isolation to
topics and
partitions

Consumer – Group Rebalancing
Can rebalance
themselves
P0 P3 P1 P2
C1 C2 C3 C4 C5 C6
Kafka Cluster
Broker 1 Broker 2
Consumer Group A Consumer Group B
X

New consumer Java API in 0.9.0
41
Configure
Subscribe
Process

Delivery Semantics
 At least once
 Messages are never lost but may be redelivered
 At most once
 Messages are lost but never redelivered
 Exactly once
 Messages are delivered once and only once
Default

Delivery Semantics
 At least once
 Messages are never lost but may be redelivered
 At most once
 Messages are lost but never redelivered
 Exactly once
 Messages are delivered once and only once
Much Harder
(Impossible??)

Kafka + X for processing the data?
 Kafka + Spark Streaming
 Near real time indexing in Search (WalmartLabs)
 Kafka Streams – Library for Stream Processing
 Kafka + Storm/Heron often used in combination, e.g. Twitter
 Samza (since Aug ’13) – also by LinkedIn
 “Normal” Java multi-threaded setups
 Akka actors with Scala or Java, e.g. Ooyala
 Kafka + Camus/goblin for Kafka->Hadoop ingestion
44
https://cwiki.apache.org/confluence/display/KAFKA/Powered+By

Kafka Ecosystem
 Kafka Connect - Used for writing sources and sinks that either
continuously ingest data into Kafka or continuously ingest data from
Kafka into external systems.
 Kafka Streams - Built-in stream processing library of the Apache
Kafka project
 Confluent Control Center -
 Kafka Manager - A tool for managing Apache Kafka.
 Kafka Mirrormaker – A tool to mirror a source Kafka cluster into a
target (mirror) Kafka cluster.
 HiveKa – Hive queries on Kafka Topic http://hiveka.weebly.com/
Src : https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem

Search Usecases @WalmartLabs
 Filter Pick up Today – Search filter allows you to
buy online and pick up from store
 Easily scales upto 15-20k events/sec
 Near real time indexing pipeline
 Uses 9 Kafka topics ~100’s gbs of data
 NRT Walmart Sponsored Ads
 Application Log Processing
 Hubble logs data – Collecting User Activity Data
 During holiday – almost scales upto 40-50k events/sec

Pick Up Today- Challenges
 Processing
• 200 Million events/day
• 5k events/sec on avg
• 12k events/sec at peak
 Storage
• 1 item ->5000 stores (5 million*5000)
• Fast retrieval

Pick Up Today – High Level Overview

Implementation Considerations
 Reprocessing
 Delivery Semantics
 Recovery from Streaming Failures
 Monitoring
 Consumer Lag Monitoring
 Apache Burrow
 Consumer Offset Checker

Streaming Component Interaction

Apache Kafka Women Who Code Meetup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Kafka Women Who Code Meetup

Similar to Apache Kafka Women Who Code Meetup (20)

Recently uploaded

Recently uploaded (20)

Apache Kafka Women Who Code Meetup