Apache Kafka - Martin Podval

Apache
Kafka
@MartinPodval, hpsv.cz

What is Apache Kafka?
Messaging System
Distributed
Persistent and Replicable
Very fast - low latency - and scalable
Simple but highly configurable
By Linkedin, open sourced under apache.org

Data Streaming
New kind of data ...
● User or application data (events) streams
● Monitoring - App, System
● App Logging
● High volume

Data Streaming Cont’d
… you want to process
● Using various components
● Into a target form
● Map, reduce, shuffle
● Real time or batch

HP Service Virtualization Use Cases
Process of clients
message streams
Real-time performance
modeling
Logs aggregation

How To Solve It?
Producers and
Consumers
● Distributed
● Decoupled
● Configurable
● Dynamic

Kafka Cluster
Brokers
● = Instances, Nodes
● Topics
● Partitions
● Replicas
ZK
● Coordination

Kafka Topics
Commit Log
● Immutable
● Ordered
● Sequential Offset

Kafka Topics Cont’d
Partitioned
Independently:
● Stored
● Produced
● Consumed
⇒ Scalable
Replicated
● On partition basis
● Different brokers
⇒ Fault Tolerant

What Can I Do?
producer.
write(topic_id, message);
consumer.
read(topic_id, offset);

I Want To Produce
● java/scala client
● address of one or more brokers
● choose a topic where to produce
● highly configurable and tunable:
○ partitioner
○ number of acks (async=0, master=1, replicas=1+?)
○ batching, buffer size, timeouts, retries, ...

I Want To Consume
High Level API
● Groups abstraction
○ To All, To One
○ To Some
● Stream API
● Stores positions to support fault tolerance

I Want To Consume Cont’d
Low Level
● Java/scala client
● Find a leader for a topic
● Calculate an offset
● Fetches messages
○ Re-consume if needed

I Want To Consume Cont’d
Delivery Semantic:
● At most once
● At least once
● Exactly once

Kafka Internals - Disks
Avoid:
● GC
● Random disk
access

Kafka Internals - Disks Cont’d
Disks are fast ...
… when properly used
● sequential access - read ahead, write behind
● rely on operating system
○ avoid heap, materialization and GC
● it’s more like file copy over network
It’s easy … with immutable topics

Kafka Internals - Replication
“In Sync” Replicas
● Replication factor on partition basis
● One leader + 0..n replicas
● Replicas are consumers
○ “In Sync” if they are not “too far” behind a leader
○ Batch sync

Kafka Internals - Replication Cont’d
Tunable Trade-Offs
● Producer’s write method:
○ Not blocked, async
○ Waits for master ACK
○ Waits for all in-sync replicas
● Consumer pulls only committed messages
● Server’s minimum in-sync replicas

Performance
“Incredible”
Scales with:
● clients count, message size
● number of replicas, partitions or topics
Depends on network and disk throughput

Performance Cont’d
Our testing
● 3 nodes, master + 2 replicas
● 500 000 msg/s (100 bytes[])
● 400 mbit/s - 1.2 gbit/s network throughput
● end2end latency 2-3 ms
@see http://bit.ly/1FsIR9a

Easy of Use
● No installation, just run a
java/scala program
● Streams in files & dirs
● Transparent zookeeper
● Ecosystem

Cons
● Beta version
● Dependency on Zookeeper
● The way how it is written in Scala
● No easy way how to remove messages

Apache Kafka - Martin Podval

More Related Content

What's hot

Similar to Apache Kafka - Martin Podval

Recently uploaded

In this document

Apache Kafka - Martin Podval