Apache
Kafka
@MartinPodval, hpsv.cz
What is Apache Kafka?
Messaging System
Distributed
Persistent and Replicable
Very fast - low latency - and scalable
Simple but highly configurable
By Linkedin, open sourced under apache.org
Data Streaming
New kind of data ...
● User or application data (events) streams
● Monitoring - App, System
● App Logging
● High volume
Data Streaming Cont’d
… you want to process
● Using various components
● Into a target form
● Map, reduce, shuffle
● Real time or batch
HP Service Virtualization Use Cases
Process of clients
message streams
Real-time performance
modeling
Logs aggregation
How To Solve It?
Producers and
Consumers
● Distributed
● Decoupled
● Configurable
● Dynamic
Kafka Cluster
Brokers
● = Instances, Nodes
● Topics
● Partitions
● Replicas
ZK
● Coordination
Kafka Topics
Commit Log
● Immutable
● Ordered
● Sequential Offset
Kafka Topics Cont’d
Partitioned
Independently:
● Stored
● Produced
● Consumed
⇒ Scalable
Replicated
● On partition basis
● Different brokers
⇒ Fault Tolerant
What Can I Do?
producer.
write(topic_id, message);
consumer.
read(topic_id, offset);
I Want To Produce
● java/scala client
● address of one or more brokers
● choose a topic where to produce
● highly configurable and tunable:
○ partitioner
○ number of acks (async=0, master=1, replicas=1+?)
○ batching, buffer size, timeouts, retries, ...
I Want To Consume
High Level API
● Groups abstraction
○ To All, To One
○ To Some
● Stream API
● Stores positions to support fault tolerance
I Want To Consume Cont’d
Low Level
● Java/scala client
● Find a leader for a topic
● Calculate an offset
● Fetches messages
○ Re-consume if needed
I Want To Consume Cont’d
Delivery Semantic:
● At most once
● At least once
● Exactly once
Kafka Internals - Disks
Avoid:
● GC
● Random disk
access
Kafka Internals - Disks Cont’d
Disks are fast ...
… when properly used
● sequential access - read ahead, write behind
● rely on operating system
○ avoid heap, materialization and GC
● it’s more like file copy over network
It’s easy … with immutable topics
Kafka Internals - Replication
“In Sync” Replicas
● Replication factor on partition basis
● One leader + 0..n replicas
● Replicas are consumers
○ “In Sync” if they are not “too far” behind a leader
○ Batch sync
Kafka Internals - Replication Cont’d
Tunable Trade-Offs
● Producer’s write method:
○ Not blocked, async
○ Waits for master ACK
○ Waits for all in-sync replicas
● Consumer pulls only committed messages
● Server’s minimum in-sync replicas
Performance
“Incredible”
Scales with:
● clients count, message size
● number of replicas, partitions or topics
Depends on network and disk throughput
Performance Cont’d
Our testing
● 3 nodes, master + 2 replicas
● 500 000 msg/s (100 bytes[])
● 400 mbit/s - 1.2 gbit/s network throughput
● end2end latency 2-3 ms
@see http://bit.ly/1FsIR9a
Easy of Use
● No installation, just run a
java/scala program
● Streams in files & dirs
● Transparent zookeeper
● Ecosystem
Cons
● Beta version
● Dependency on Zookeeper
● The way how it is written in Scala
● No easy way how to remove messages
Questions?

Apache Kafka - Martin Podval

  • 1.
  • 2.
    What is ApacheKafka? Messaging System Distributed Persistent and Replicable Very fast - low latency - and scalable Simple but highly configurable By Linkedin, open sourced under apache.org
  • 3.
    Data Streaming New kindof data ... ● User or application data (events) streams ● Monitoring - App, System ● App Logging ● High volume
  • 4.
    Data Streaming Cont’d …you want to process ● Using various components ● Into a target form ● Map, reduce, shuffle ● Real time or batch
  • 5.
    HP Service VirtualizationUse Cases Process of clients message streams Real-time performance modeling Logs aggregation
  • 6.
    How To SolveIt? Producers and Consumers ● Distributed ● Decoupled ● Configurable ● Dynamic
  • 7.
    Kafka Cluster Brokers ● =Instances, Nodes ● Topics ● Partitions ● Replicas ZK ● Coordination
  • 8.
    Kafka Topics Commit Log ●Immutable ● Ordered ● Sequential Offset
  • 9.
    Kafka Topics Cont’d Partitioned Independently: ●Stored ● Produced ● Consumed ⇒ Scalable Replicated ● On partition basis ● Different brokers ⇒ Fault Tolerant
  • 10.
    What Can IDo? producer. write(topic_id, message); consumer. read(topic_id, offset);
  • 11.
    I Want ToProduce ● java/scala client ● address of one or more brokers ● choose a topic where to produce ● highly configurable and tunable: ○ partitioner ○ number of acks (async=0, master=1, replicas=1+?) ○ batching, buffer size, timeouts, retries, ...
  • 12.
    I Want ToConsume High Level API ● Groups abstraction ○ To All, To One ○ To Some ● Stream API ● Stores positions to support fault tolerance
  • 13.
    I Want ToConsume Cont’d Low Level ● Java/scala client ● Find a leader for a topic ● Calculate an offset ● Fetches messages ○ Re-consume if needed
  • 14.
    I Want ToConsume Cont’d Delivery Semantic: ● At most once ● At least once ● Exactly once
  • 15.
    Kafka Internals -Disks Avoid: ● GC ● Random disk access
  • 16.
    Kafka Internals -Disks Cont’d Disks are fast ... … when properly used ● sequential access - read ahead, write behind ● rely on operating system ○ avoid heap, materialization and GC ● it’s more like file copy over network It’s easy … with immutable topics
  • 17.
    Kafka Internals -Replication “In Sync” Replicas ● Replication factor on partition basis ● One leader + 0..n replicas ● Replicas are consumers ○ “In Sync” if they are not “too far” behind a leader ○ Batch sync
  • 18.
    Kafka Internals -Replication Cont’d Tunable Trade-Offs ● Producer’s write method: ○ Not blocked, async ○ Waits for master ACK ○ Waits for all in-sync replicas ● Consumer pulls only committed messages ● Server’s minimum in-sync replicas
  • 19.
    Performance “Incredible” Scales with: ● clientscount, message size ● number of replicas, partitions or topics Depends on network and disk throughput
  • 20.
    Performance Cont’d Our testing ●3 nodes, master + 2 replicas ● 500 000 msg/s (100 bytes[]) ● 400 mbit/s - 1.2 gbit/s network throughput ● end2end latency 2-3 ms @see http://bit.ly/1FsIR9a
  • 21.
    Easy of Use ●No installation, just run a java/scala program ● Streams in files & dirs ● Transparent zookeeper ● Ecosystem
  • 22.
    Cons ● Beta version ●Dependency on Zookeeper ● The way how it is written in Scala ● No easy way how to remove messages
  • 23.