Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Kafka - Martin Podval

2,907 views

Published on

Slides about Apache Kafka presented on Czech Java User Group.

Published in: Software
  • Be the first to comment

Apache Kafka - Martin Podval

  1. 1. Apache Kafka @MartinPodval, hpsv.cz
  2. 2. What is Apache Kafka? Messaging System Distributed Persistent and Replicable Very fast - low latency - and scalable Simple but highly configurable By Linkedin, open sourced under apache.org
  3. 3. Data Streaming New kind of data ... ● User or application data (events) streams ● Monitoring - App, System ● App Logging ● High volume
  4. 4. Data Streaming Cont’d … you want to process ● Using various components ● Into a target form ● Map, reduce, shuffle ● Real time or batch
  5. 5. HP Service Virtualization Use Cases Process of clients message streams Real-time performance modeling Logs aggregation
  6. 6. How To Solve It? Producers and Consumers ● Distributed ● Decoupled ● Configurable ● Dynamic
  7. 7. Kafka Cluster Brokers ● = Instances, Nodes ● Topics ● Partitions ● Replicas ZK ● Coordination
  8. 8. Kafka Topics Commit Log ● Immutable ● Ordered ● Sequential Offset
  9. 9. Kafka Topics Cont’d Partitioned Independently: ● Stored ● Produced ● Consumed ⇒ Scalable Replicated ● On partition basis ● Different brokers ⇒ Fault Tolerant
  10. 10. What Can I Do? producer. write(topic_id, message); consumer. read(topic_id, offset);
  11. 11. I Want To Produce ● java/scala client ● address of one or more brokers ● choose a topic where to produce ● highly configurable and tunable: ○ partitioner ○ number of acks (async=0, master=1, replicas=1+?) ○ batching, buffer size, timeouts, retries, ...
  12. 12. I Want To Consume High Level API ● Groups abstraction ○ To All, To One ○ To Some ● Stream API ● Stores positions to support fault tolerance
  13. 13. I Want To Consume Cont’d Low Level ● Java/scala client ● Find a leader for a topic ● Calculate an offset ● Fetches messages ○ Re-consume if needed
  14. 14. I Want To Consume Cont’d Delivery Semantic: ● At most once ● At least once ● Exactly once
  15. 15. Kafka Internals - Disks Avoid: ● GC ● Random disk access
  16. 16. Kafka Internals - Disks Cont’d Disks are fast ... … when properly used ● sequential access - read ahead, write behind ● rely on operating system ○ avoid heap, materialization and GC ● it’s more like file copy over network It’s easy … with immutable topics
  17. 17. Kafka Internals - Replication “In Sync” Replicas ● Replication factor on partition basis ● One leader + 0..n replicas ● Replicas are consumers ○ “In Sync” if they are not “too far” behind a leader ○ Batch sync
  18. 18. Kafka Internals - Replication Cont’d Tunable Trade-Offs ● Producer’s write method: ○ Not blocked, async ○ Waits for master ACK ○ Waits for all in-sync replicas ● Consumer pulls only committed messages ● Server’s minimum in-sync replicas
  19. 19. Performance “Incredible” Scales with: ● clients count, message size ● number of replicas, partitions or topics Depends on network and disk throughput
  20. 20. Performance Cont’d Our testing ● 3 nodes, master + 2 replicas ● 500 000 msg/s (100 bytes[]) ● 400 mbit/s - 1.2 gbit/s network throughput ● end2end latency 2-3 ms @see http://bit.ly/1FsIR9a
  21. 21. Easy of Use ● No installation, just run a java/scala program ● Streams in files & dirs ● Transparent zookeeper ● Ecosystem
  22. 22. Cons ● Beta version ● Dependency on Zookeeper ● The way how it is written in Scala ● No easy way how to remove messages
  23. 23. Questions?

×