Kafka tiered-storage-meetup-2022-final-presented

Kafka Tiered
Storage Using
Azure Blobs
Towards running Kafka
cost-effectively in Azure
Sumant Tambe
Sr. Software Engineer, Core Eng, LinkedIn
March 4, 2022

Agenda
1 Cost of Running Kafka in Azure
2 Tiered Storage
3 KIP-405 Kafka Tiered Storage
Architecture
4 Operationalizing Tiered
Storage Using Azure Blobs
5 Demo

Cost of Running Kafka In Azure
• To minimize the migration and schedule risks LinkedIn migrated
Kafka to Azure as is*
• Kafka is using remote managed disks in LinkedIn Azure
deployments
• LinkedIn's common Kafka VM in Azure
• 64 core general-purpose VM
• RAID0 array of remote managed SSDs for throughput
• No RAID mirroring needed as managed disk are replicated
• Azure SKUs with local disks more expensive than a similar disk-less
SKUs with managed disks attached to them
• Kafka is throughput-sensitive and not latency-sensitive
• The size of the VM local disk is also a limitation.
• Cluster size has a cap
*https://engineering.linkedin.com/blog/2019/building-next-infra

Cost of Running Kafka In Azure
• Standard and premium SSDs cost more than managed HDDs as
they offer better SLAs
• Better bandwidth and P99 latency guarantees
• For example, Bursting support in Azure Standard and Premium
SSD
• The connectivity SLA to the VM improves as they use
more expensive disks.
• Standard SSDs have transaction costs
• Large number of small writes tend to be more expensive
than small number of large writes
• Costs include VM, disk, transactions, and bursts
• Each read/write IO unit up to 256 KB is a billable IOP

What is Tiered Storage
Kafka storage in two persistent layers--local
disk for fresh data and remote store for older
data.
Remote Store
Local Disk
Page
Cache

Kafka Tiered Storage Implementations In The Wild
• KIP-405
• Led by Uber
• HDFS and S3 (Uber)
• Azure Blob Storage (LinkedIn and Microsoft)
• possibly more...
• Confluent Tiered Storage
• Amazon S3, Google Cloud Storage, and Pure Storage FlashBlade (as of version 6.0.0)
• Not open-source

Why Use Kafka Tiered Storage?
Kafka storage in two persistent layers--local disk for fresh data and remote store for older data
Pros Cons
Use cheaper storage for older data Potentially higher latency for older data
Potentially increase retention at lower cost More complex due to additional system dependencies
Reduce coupling between broker compute and storage Compacted topics are not supported
Facilitate independent scaling of compute and storage
Faster cluster maintenance operations. New brokers
catch up to ISR much faster due to smaller local tier.

KIP-405*: Kafka Tiered Storage Architecture
• Kafka Improvement Proposal—KIP-405
• Remote Log Manager
• (Pluggable) Remote Storage Manager
• Stores data and indexes
• (Pluggable) Remote Log Metadata Manager
• Stores startOffset, endOffset, etc. Per
segment
• Manages lifecycle of segments
• Kafka internals interact with
RemoteLogManager only
* Image Source: https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage
Producers Consumers
(Oblivious to tiering)

KIP-405: Kafka Tiered Storage Architecture
• RemoteStorageManager and
RemoteLogMetadataManager are separate
• This allows stronger consistency
guarantees for metadata than data.
• The cost of using an object store for
metadata may be too high in cloud
environment due to frequent
r/w access.

KIP-405 Topic-Partition Offsets Due to Tiering
• ListOffset API now offers
• EARLIEST, LATEST, and EARLIEST_LOCAL_TIMESTAMP
Local Local
Remote
Earliest
Offset
Earliest Local
Offset
Last Stable
Offset
Log End
Offset
Remote Log
End Offset
Newer Older

KIP-405 Pluggable Remote Storage Manager (RSM)
This interface provides storing and fetching remote log segment data and indexes
• The API and the implementation operate at the segment level
• Segment details don’t leak outside Kafka
• API
• CopyLogSegmentData (copies data and index)
• FetchLogSegment
• FetchIndex
• DeleteLogSegment

KIP-405 Pluggable Remote Log Metadata Manager (RLMM)
This interface provides storing and fetching
remote log segment metadata with strongly
consistent semantics.
• RemoteLogMetadataManager
• Create/Read/Update API for
RemoteLogSegmentMetadata
Copy
Segment
Started
Copy
Segment
Finished
Delete
Segment
Started
Delete
Segment
Finished
Can read the
segment in this
state only

Basic Remote Log Segment Life-Cycle in the Leader
• ReplicaManager interacts with RemoteLogManager
• When leader copies the log segments to remote tier and clean up the expired remote log
segments
• RemoteLogManager interacts with RemoteLogMetadataManager and
RemoteStoreManager to change the state of the log segment
Remote Log
Metadata
Manager
Remote
Storage
Manager
Add
Remote
Log
Segment
Metadata
Copy
Log
Segment
Data +
index
Update
Remote
Log
Segment
Metadata
Time
Time
COPY_STARTED COPY_FINISHED
Fetch
Log
Segment
+ index
Fetch
Log
Segment
Fetch
Log
Segment
DELETE_STARTED DELETE_FINISHED
Update
Remote
Log
Segment
Metadata
Update
Remote
Log
Segment
Metadata
Delete
Log
Segment
Data +
index
Read 0-N times

Data and Metadata Movement in RSM and RLMM
1. Leader copies log segments with
the auxiliary state (includes time
index, offset index, and producer-id
snapshots) to the remote storage.
2. Leader publishes remote log
segment metadata about
beginning and completion of log
segment copy.
3. Follower fetches the messages from
the leader.
4. Follower consumes the required
remote log segment metadata

The TopicBasedRemoteLogMetadataManager (RLMM)
Must provide strongly consistent view of the metadata
• Read-your-own-writes
• All nodes read the metadata updates in the same order (log append order)
The internal topic name is __remote_log_metadata
High Durability configs
• Replication factor = 4
• min.insync.replicas = 2 (tolerate one planned and one unplanned failure in the cluster)
• unclean.leader.election.enable=false
• Producer sends data with Acks=all
• Currently not a Compact topic. Grows indefinitely.

Work-In-Progress Remote Storage Manager for Azure Blob Storage

Possible Enhancements
RemoteLogMetadataManager and RemoteStorageManager
• Currently, stateless mapping of RemoteLogSegmentId to remote store
container/bucket
• The address space of the remote log segments is assumed to be well-known
(global) for ALL the segments
• However, If multiple storage accounts are needed for throughput or capacity
reasons, currently there's no way to carry the "extra" metadata (location, e.g.)
• Ideally the "extra" metadata can be opaque: byte[]

Operationalizing Kafka Tiered Storage with Azure Blob Storage
How to size brokers' local storage?
• % of data stored locally
• Depends on the maximum time the remote store may be unavailable
• (as of this writing) Azure guarantees 99.9% r+w availability of LRS, ZRS, and GRS
accounts. Equivalent to exactly one incidence of 8hr:45min downtime in year.
• Ensure 9+ hours of local storage headroom all the time. In practice, may be 18 hrs or a
day.
Determining the local retention time of the individual topics
• 0.5 hours is the current proposal
Monitoring Tiered Storage
• How much local storage is remaining, etc. using Cruise-Control

Operationalizing Kafka Tiered Storage with Azure Blob Service
Local segment size
• Large writes and reads to/from the remote store
• Read latency increases. Bootstrapping consumer FetchRequest may expire.
Remember, no cache as of this writing
• 100MB suggested in Confluent's implementation of tiered storage. 256 MB in
LinkedIn without tiering.
The __remote_log_metadata topic retention.ms longer than all topic in the cluster
• May be compact retention policy?

A Quick Demo
1. Produce and Consume
2. Write segments to Azure Blob Storage
3. Fetch from Azure Blob Storag and resume
locally
4. Browse Azure Blob Store contents
5. Peek __remote_log_metadata

Kafka tiered-storage-meetup-2022-final-presented

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kafka tiered-storage-meetup-2022-final-presented

Similar to Kafka tiered-storage-meetup-2022-final-presented (20)

More from Sumant Tambe

More from Sumant Tambe (20)

Recently uploaded

Recently uploaded (20)

Kafka tiered-storage-meetup-2022-final-presented