Citi TechTalk Session 2: Kafka Deep Dive

Apache Kafka®
Internal Architecture:
The Fundamentals

developer.confluent.io
What’s Covered
3 Consumer Group Protocol
4 Transactions
1 Inside the Broker
Control Panel, Replication Protocols, Topic Compaction, etc.
2 Producer - Durability, Availability
and Ordering Guarantees

Overview of Kafka Architecture
Compute Layer
Storage Layer

Events
Event Processing
Application
Event Source Event Stream
Record =>
timestamp
key
value
Headers

Record Schema
Record =>
timestamp
key
value
Headers
Event Stream
key/
value
Bytes
Area Description
0
Magic
Byte
Confluent serialization format version number;
currently always 0.
1-4
Schema
ID
4-byte schema ID as returned by Schema
Registry.
5-... Data Serialized data for the specified schema format.

Kafka Topics
Event Stream
Event
Source
Event
Consume
r
Kafka Cluster
account-deposits
Events are immutable Append only

Topic Partitions
Kafka Cluster
Broker
partition 1
account-deposits
Broker
partition 4
account-deposits
partition 2
account-deposits
Broker
partition 3
account-deposits
Broker
partition 0
account-deposits
Confluent Tiered Storage
(AWS S3, Google Cloud Storage)

Partition Offsets
Kafka Cluster
Broker
partition 1
0 1 2 3
account-deposits
Broker
partition 4
0 1 2 3
account-deposits
partition 2
0 1 2
account-deposits
Broker
partition 3
0 1 2 3 4
account-deposits
Broker
partition 0
0 1
account-deposits
Confluent Tiered Storage
(AWS S3, Google Cloud Storage)
The first event
is offset 0
Offsets increase
monotonically

Inside the Apache Kafka® Broker

Kafka Manages Data & Metadata Separately
Kafka Cluster
Controller
Broker
Controller
Broker
Broker
KRaft Consensus protocol
for the Control Plane
● Old: Zookeeper
● New: KRaft Controller
Client request processing
+ data replication
Control Plane
Data Plane
Broker
Controller
(active)
Broker Broker
THIS MODULE
Inside the Apache® Kafka Broker

Inside the Apache Kafka Broker
Broker Broker
Broker Broker
Broker Broker

Assign Record to Topic Partition
Record =>
timestamp
key
value
Headers

Records Accumulated Into Record Batches
Record =>
timestamp
key
value
Headers
RecordBatch =>
…
…
attributes: int16
bit 0~2:
0: no compression
1: gzip
2: snappy
3: lz4
4: zstd
…
…
records: [Records]
Compression
Record 1
Record 2
Record n
…

Record Batch 1
Record Batch 2
Record Batch n
.
.
.
Produce Request => acks [topic_data]
acks => INT16
topic_data => topic [data]
topic => STRING
data => partition record_set
partition => INT32
topic_data => topic [data]
topic => STRING
data => partition record_set
partition => INT32
record_set => BYTES
.
.
.
record_set => BYTES
Record Batches Drained Into Produce Requests
Record =>
timestamp
key
value
Headers
RecordBatch =>
…
…
attributes: int16
bit 0~2:
0: no compression
1: gzip
2: snappy
3: lz4
4: zstd
…
…
records: [Records]
Compression
Record 1
Record 2
Record n
…
linger.ms
batch.size

Network Thread Adds Request to Queue

IO Thread Verifies Record Batch And Stores

Kafka Physical Storage
/var/lib/kafka/data/account-deposits-1
00000000000047926734.log
00000000000047926734.index
...
00000000000052497535.log
00000000000052497535.index
...

By Default, Data Is Written Asynch to Disk

Purgatory Holds Requests Being Replicated

Response Added to Socket Send Buffer

Fetch Requests Contiguous fetch
ranges are calculated
Fetch from object storage
(if not available locally)
Data is (zero) copied
to the send buffer

Producer - Configuring
Durability, Availability and
Ordering Guarantees

Producer
23
acks=1
enable.idempotence=false
max.request.size=1MB
retries=MAX_INT
delivery.timeout.ms=2min
max.in.flight.requests.
per.connection=5
Serializer
● Retrieves and
caches schemas
from Schema
Registry
Partitioner
● Java client uses
murmur2 for
hashing
● If key not
provided
performs round
robin
● If keys
unbalanced it will
overload one
leader
● Upcoming
changes in KIP-
794
Sender thread
● Batches grouped
by destination
broker into
requests
● Multiple batches
to different
partitions
potentially in the
same producer
request
Record accumulator
● Buffer per partition,
seldom used partitions
may not achieve high
batching
● If many producers are in
the same JVM, memory
and GC could become
important
● Sticky partitioner could
be used to increase
batches in the case of
round robin (KIP-
408/KIP-794)
Compression
● At batch level
● Allows faster transfer to
the broker
● Reduces the inter
broker replication load
● Reduces page cache &
disk space utilization on
brokers
● Gzip is more CPU
intensive, Snappy is
lighter, LZ4/ZStd are a
good balance*
compress.type=none
batch.size=16KB
buffer.memory=32MB
max.block.ms=60s
record batch request
batch.size=16KB
linger.ms=0
buffer.memory=32MB
max.block.ms=60s
compress.type=none

broker-101
broker-102
broker-103
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
offset
offset
offset
leader
follower
follower
partition 0
partition 0
partition 0
producer
ISR = [101, 102, 103]
Data Durability and Availability Guarantees

broker-101
broker-102
broker-103
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
offset
offset
offset
leader
follower
follower
Producer acks=0
partition 0
partition 0
partition 0
producer
“fire and forget”
No response
from broker
ISR = [101, 102, 103]

0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
leader
follower
follower
ISR = [101, 102, 103]
offset
offset
offset
Producer acks=1
partition 0
partition 0
partition 0
producer
Broker acks once
written to leader
broker-101
broker-102
broker-103

broker-101
broker-102
broker-103
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
offset
offset
offset
Producer acks=all
partition 0
partition 0
partition 0
producer
Broker acks
once written to
all ISR members
ISR = [101, 102, 103]
leader
follower
follower

broker-101
broker-102
broker-103
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
offset
offset
offset
partition 0
partition 0
partition 0
ISR = [101, 102, 103]
leader
follower
follower
producer
Topic min.insync.replicas
Not enough
replicas exception
min.insync.replica=2

Producer Idempotence
new
leader
leader
follower
pid: 1
seq: 0
m1
pid: 1
seq: 1
m2
pid: 1
seq: 2
m3
pid: 1
seq: 0
m1
pid: 1
seq: 1
m2
pid: 1
seq: 2
m3
pid: 1
seq: 0
m1
pid: 1
seq: 1
m2
pid: 1
seq: 2
m3
partition 0
partition 0
partition 0
new
leader
leader
follower
m1 m2 m1 m2 m3
m1 m2 m3
m1 m2 m1 m2 m3
partition 0
partition 0
partition 0
producer
producer
Idempotence seq #’s
prevent both of these
issues
Without idempotence,
duplicate and out of
order records can occur
enable.idempotence=true
Broker
(any)

t1
P0
P1
Kafka
t2
P0
P1
Consumer
Group
Consumer 1
group.id=1
Consumer 2
group.id=1
Consumer 3
group.id=1
Consumer 4
group.id=1
Kafka Consumer Group
To enable group consumption:
1) Configure group.id
2)topics={“t1”, “t2”};
consumer.subscribe(topics)
Benefits:
1) Scalability
2)Elasticity
3)Fault tolerance

Group Coordinator
Broker
Group Coordinator
__consumer_offsets
Broker
Group Coordinator
__consumer_offsets
Broker
Group Coordinator
__consumer_offsets
Consumer Group
Consumer 1
group.id=1
Consumer 2
group.id=1
Consumer Group
Consumer 1
group.id=2
Consumer 2
group.id=2
Group coordinator
determined by
group.id

Group Startup: Step 1 - Find Group Coordinator
Broker
Group Coordinator
__consumer_offsets
Broker
__consumer_offsets
Broker
__consumer_offsets
Consumer Group
Consumer 1
group.id=1
Consumer 2
group.id=1
Sent to any broker

Group Startup: Step 2 - Members Join
Broker
Group Coordinator
__consumer_offsets
Broker
topic-a
Broker
topic-a
Consumer Group
Consumer 1
group.id=1
group leader
Consumer 2
group.id=1

Group Startup: Step 3 - Partitions Assigned
Consumer Group
Consumer 1
group.id=1
group leader
Consumer 2
group.id=1
Broker
Group Coordinator
__consumer_offsets
Broker
topic-a
Broker
topic-a

customers
P0
P1
Kafka
orders
P0
P1
Range Partition Assignment Strategy
Consumer Group
Consumer 1
Consumer 2
Consumer 4
Consumer 5
Consumer 3
Idle instances
P1 from
both topics
P0 from
both topics

Round Robin and Sticky Partition
Assignment Strategies
Consumer Group
Consumer 1
Consumer 2
Consumer 4
Consumer 5
Consumer 3
customers
P0
P1
orders
P0
P1
Even distribution
of partitions
Idle instance
Kafka

Consumer Group
Consumer 1
group.id=1
poll( )
Consumer 2
group.id=1
poll( )
Tracking Partition Consumption
Broker
Group Coordinator
__consumer_offsets
Broker
topic-a
Broker
topic-a
key group.id,
topic,
partition
value offset number

Consumer Group
Consumer 1
group.id=1
poll( )
Consumer 2
group.id=1
poll( )
Determining Starting Offset to Consume
Broker
Group Coordinator
__consumer_offsets
Broker
topic-a
Broker
topic-a
If no committed
offset is available,
auto.offset.reset
value determines
starting offset

Group Coordinator Failover
Broker
Group Coordinator
__consumer_offsets
Broker
Group Coordinator
__consumer_offsets
Broker
Group Coordinator
__consumer_offsets
Consumer Group
Consumer 1
group.id=1
Consumer 2
group.id=1
Group coordinator
fails over to
__consumer_offsets
new partition leader

Consumer Group
Consumer 1
group.id=1
poll( )
HeartbeatThread
Consumer Group Rebalance Triggers
topic_a
P0
P1
P2
P3
topic_b
P0
P1
Topic added or deleted that matches subscription
consumer.subscribe(Pattern.compile("topic_.*");
# of partitions
increases, e.g.
from 3 to 4
Consumer instance
joins or leaves group,
e.g. heartbeat timeout
Consumer 2
group.id=1
poll( )
HeartbeatThread
Consumer 3
group.id=1
poll( )
HeartbeatThread

Consumer Group
Consumer 1
group.id=1
poll( )
HeartbeatThread
Consumer 2
group.id=1
poll( )
HeartbeatThread
Consumer Group Rebalance Notification
Broker
Group Coordinator
__consumer_offsets
Broker
topic-a
Broker
topic-a

Stop-the-world Rebalance
Group coordinator
Consumer 1 (p0,p2)
Consumer 2 (p1)
Consumer 3 joins
Synchronization barrier
(p0)
(p1)
(p2)
1
2 3
4
Consumers:
1) Revoke current partition
assignment and clean up
the partition states
2)Join the group
3)Sync with the group
4)Receive new partition
assignments
a)Build the partition
state
b)Resume consumption

Stop-the-world Problem 1 -
Rebuilding the State
Group coordinator
Consumer 1 (p0,p2)
Consumer 2 (p1)
Consumer 3 joins
(p0)
(p1)
(p2)
Since partitions p0 and p1
are assigned to the same
consumer instance,
rebuilding the state is
unnecessary

Stop-the-world Problem 2 -
Paused Processing
Processing
paused
Group coordinator
Consumer 1 (p0,p2)
Consumer 2 (p1)
Consumer 3 joins
(p0)
(p1)
(p2)
Processing pauses for all
subscribed partitions for
the duration of the
rebalance
● The pausing for p0 and
p1 is unnecessary

Avoid Needless State Rebuild with StickyAssignor
Group coordinator
Consumer 1 (p0,p2)
Consumer 2 (p1)
Consumer 3 joins
(p0, p2)
(p1)
(p2)
Processing
paused
Partition reassigned
● State cleanup and build
Assigned partitions self-revoked
● Clean up state

Avoid Processing Pause with CooperativeStickyAssignor
Consumer 1 (p0,p2)
Consumer 2 (p1)
Consumer 3 joins
(p0)
(p1)
(p2)
p0, p1
consumption
continues
1st rebalance
Synchronization
barrier
SyncGroupResponse revokes p2 assignment
World does not stop!
p2
revoked

Avoid Processing Pause with CooperativeStickyAssignor
Consumer 1 (p0,p2)
Consumer 2 (p1)
Consumer 3 joins
(p0)
(p1)
(p2)
1st rebalance 2nd rebalance
Synchronization
barrier
p2
revoked
SyncGroupResponse
assigns p2
p0, p1
consumption
continues

Consumer Group
Consumer 1
group.id=1
group.instance.id=1
HeartbeatThread
Avoid Rebalance with Static
Group Membership
topic_a
P0
P1
P2
P3
topic_b
P0
P1
Consumer 2
group.id=1
group.instance.id=2
HeartbeatThread
Consumer 3
group.id=1
group.instance.id=3
HeartbeatThread
Establishes
static group
membership
Members do not
send LeaveGroup
request when
they are stopped
Group
Coordinator
No rebalance if
member rejoins prior
to session.timeout.ms

Transactions in Apache Kafka®

Broker
Consumer Group
Coordinator
__consumer_offsets topic
1 P7
Broker
balances topic
($10)
transfers topic
A->B
Broker
balances topic
$10
P0
P0
P1
Why Are Transactions Needed?
Funds Transfer App
Consumer API
Producer API
Alice
Bob
transfer $10
Alice → Bob
1
4
Alice pays
Bob $10
1. Event is fetched by the
consumer
2. Debit event is written
3. Credit event is written
4. Transfer event offset is
committed
Event is written to
transfers topic
Alice, ($10)
Bob, $10
2
3
Downstream App
Consumer API

Atomic Transaction
Broker
Consumer Group
Coordinator
P7
Broker
balances topic
($10) A
transfers topic
A->B
Broker
balances topic
P0
P0
P1
Kafka Transactions Deliver Exactly Once
Funds Transfer App
Consumer API
Producer API
Alice
Bob
transfer $10
Alice → Bob
1
4
Alice, ($10)
Bob, $10
2
3
Downstream App
Consumer API
Transaction is only
committed if all
parts succeed
Is aborted if
any part fails
Using transactions with
Kafka Streams is quite
simple:
1) Set processing.guarantee
to exactly_once_v2 in
StreamsConfig
2) Set isolation.level to
read_committed in the
Consumer configuration

A
Broker
Consumer Group
Coordinator
P7
Broker
balances topic
($10)
transfers topic
A->B
Broker
balances topic
P0
P0
P1
System Failure Without Transactions
Funds Transfer App
Consumer API
Producer API
Alice
Bob
transfer $10
Alice → Bob
1. Event fetched by consumer
2. Alice’s account debited
1
2 Alice, ($10)
Downstream App
Consumer API

A
Broker
Consumer Group
Coordinator
1 P7
Broker
balances topic
($10)($10)
transfers topic
A->B
Broker
balances topic
$10
P0
P0
P1
Downstream App
Consumer API
System Failure Without Transactions
Funds Transfer App
Consumer API
Producer API
Alice
Bob
transfer $10
Alice → Bob
Funds Transfer App
Consumer API
Producer API
Application instance fails
without committing offset and
new application instance starts
2. Alice’s account is debited a
second time
3. Bob’s account is credited
4. Consumer offset committed
5. Two debit events processed
by downstream consumer
1
Alice, ($10)
2
Alice, ($10)
Bob, $10
2
5
1
3
4

A
Broker
Consumer Group
Coordinator
P7
Broker
balances topic
($10)
transfers topic
A->B
Broker
balances topic
P0
P0
P1
Broker
Transaction Coordinator
__transaction_state topic
System Failure with Transactions
Funds Transfer App
Consumer API
Producer API
Alice
Bob
transfer $10
Alice → Bob
transactional.id='fund-tr'
Coordinates txn and
persists txn metadata
2
1
Alice, ($10)
4
3
Downstream App
Consumer API
'fund-tr'=>pid e0 P0
1. Requests txn ID, is returned
PID and txn epoch
3. Notifies coordinator of
partition being written to
isolation.level=
'read_committed'

A
Broker
Consumer Group
Coordinator
P7
Broker
balances topic
($10) A
transfers topic
A->B
Broker
balances topic
P0
P0
P1
System Failure with Transactions
Funds Transfer App
Consumer API
Producer API
Broker
Alice
Bob
transfer $10
Alice → Bob
2
1
Alice, ($10)
4
3
Downstream App
Consumer API
1. Requests txn ID, is returned PID
and txn epoch
3. Notifies coordinator of partition
being written to
Application instance fails
without committing offset and
new application instance starts
1. New instance requests txn ID
a. Coordinator fences
previous instance by
aborting pending txn and
bumping up epoch
2. Downstream consumer with
read_committed discards
aborted events
Funds Transfer App
Consumer API
Producer API 1
2
isolation.level=
'read_committed'
'fund-tr'=>pid e0 P0 A pid e1

A
Broker
Consumer Group
Coordinator
1 C P7
Broker
balances topic
($10) C
transfers topic
A->B
Broker
balances topic
$10 C
P0
P0
P1
Broker
'fund-tr'=>pid e0 P0 P1 P7 C
System with Successful Committed Transaction
Funds Transfer App
Consumer API
Producer API
Downstream App
Consumer API
Alice
Bob
transfer $10
Alice → Bob
1. Requests txn ID and
assigned PID and epoch
3. Notifies coordinator of
partition being written to
5. Bob’s account credited
6. Consumer offset committed
7. Notify coordinator that
transaction is complete
8. Coordinator writes commit
markers to p0, p1, p7
9. Downstream consumer with
read_committed processes
committed events
2
1
Alice, ($10)
4
3
Bob, $10
5
6
7
9
isolation.level=
'read_committed'
8

Consuming Transactions with read_committed
● Leader maintains last stable offset
(LSO), the smallest offset of any
open transaction
● Fetch response includes
○ only records up to LSO
○ metadata for skipping aborted
records
Broker
balances topic
57
pid 1
($10)
58
pid 2
($8)
60
pid 2
$8
61
pid 1
A
62
pid 1
($7)
63
pid 2
C
64
pid 2
($9)
65
pid 1
$7
66
pid 2
$9
67
pid 1
C
LSO
Consumer able to
read these records
Offset 57 discarded
by consumer
fetch
response
HW

A
Interacting with External Systems
Atomic writes to Kafka and
external systems are not
supported
● Instead, write the
transactional output to a
Kafka topic first
● Rely on idempotence to
propagate the data from
the output topic to the
external system
Broker
Consumer Group
Coordinator
1 C P7
Broker
balances topic
($10) C
transfers topic
A->B
Broker
balances topic
$10 C
P0
P0
P1
Broker
txn id=>pid e0 P0 P1 P7 C
Funds Transfer App
Consumer API
Producer API
Alice
Bob
transfer $10
Alice → Bob
External System
Kafka
Connect
'fund-tr'=>pid e0 P0 P1 P7 C

Citi TechTalk Session 2: Kafka Deep Dive

Recommended

Recommended

More Related Content

Similar to Citi TechTalk Session 2: Kafka Deep Dive

Similar to Citi TechTalk Session 2: Kafka Deep Dive (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

Citi TechTalk Session 2: Kafka Deep Dive