Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Kafka
Introduction to Apache Kafka
Kaunas JUG
Saulius Tvarijonas · saulius.tvarijonas@gmail.com
Who is this person?
Saulius Tvarijonas - saulius.tvarijonas@gmail.com
● VP of Software Development @ CUJO
● 20 years of ja...
What is CUJO?
● CUJO is a smart firewall that keeps your connected home safe
● Security
● Parental Control
● Big Data & Ma...
Agenda
● History
● Concepts
● Efficiency
● Development
● Operations
History
● Originally developed by LinkedIn
● Open sourced in early 2011
● Authors founded Confluent company
● Latest versi...
Positioning
● Distributed streaming platform
● Merge queue and publish-subscribe concepts
● Between JMS and log aggregatio...
Concepts
● Kafka is run as a cluster on one or more servers
● The Kafka cluster stores streams of records in
categories ca...
Topic
Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log....
Producers & Consumers
Offsets are controlled by consumer. Just a number (very cheap).
Partition
● Allow the log to scale beyond a size that will fit on a single server
● Act as the unit of parallelism
Replication
● Topics can (and should) be replicated
● Unit of replication is partition
● Each partition has 1 leader and 0...
Replication
Durability Guarantees
● Options
○ Do not wait for ACK
○ Wait for ACK from leader
○ Wait for ACK from all ISRs
● Disable un...
Replica Management
● One broker elected as controller
● ZooKeeper used for metadata and coordination
● Rebalancing, rebala...
Message Delivery Semantics
● At most once — Messages may be lost but are never redelivered.
● At least once — Messages are...
Log Compaction
Streams
● Simple and lightweight client library
● Transparently handles the load balancing of multiple instances
● Fault-t...
Serialization
JSON, Avro, Protobuf, Thrift, XML?
Compression
● Compression at topic level
● Compressed by producer in batches
● Codecs
○ Gzip
○ Snappy
○ Lz4
Message Format
1. 4 byte CRC32 of the message
2. 1 byte "magic" identifier to allow format changes, value is 0 or 1
3. 1 b...
Use Cases
● Messaging System (ActiveMQ, RabbitMQ)
● Storage System
● Stream Processing (Storm, Samza, Spark)
● Log Aggrega...
Efficiency
● Main reasons for high throughput and low latency
○ Batch of individual messages
○ Zero copy I/O using sendfil...
Performance
Development - API
● Producer API
● Consumer API
● Streams API
● Connect API
Development
● Spring-kafka
● Spring Integration Kafka
● Spring Boot
Operations
● Command line
● JMX monitoring
● Kafka Manager
● ZooKeeper cluster
Mirroring data between clusters
● Provide a replica in another datacenter
● Different number of partitions
● Order by key ...
Java Forever
Thank
You!
Upcoming SlideShare
Loading in …5
×

Introduction to Apache Kafka

149 views

Published on

Nowdays, IT system become complex. Do I need yet another messaging system for my long tech stack? We will see how Kafka compares to traditional messaging systems, what's inside and how to use in java stack.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Introduction to Apache Kafka

  1. 1. Kafka Introduction to Apache Kafka Kaunas JUG Saulius Tvarijonas · saulius.tvarijonas@gmail.com
  2. 2. Who is this person? Saulius Tvarijonas - saulius.tvarijonas@gmail.com ● VP of Software Development @ CUJO ● 20 years of java experience ● Area of Interest ○ Distributed Systems ○ Big Data ○ Performance optimizations ○ DevOps
  3. 3. What is CUJO? ● CUJO is a smart firewall that keeps your connected home safe ● Security ● Parental Control ● Big Data & Machine Learning based protection www.getcujo.com
  4. 4. Agenda ● History ● Concepts ● Efficiency ● Development ● Operations
  5. 5. History ● Originally developed by LinkedIn ● Open sourced in early 2011 ● Authors founded Confluent company ● Latest version 0.10.1.1 ● Scala + Java
  6. 6. Positioning ● Distributed streaming platform ● Merge queue and publish-subscribe concepts ● Between JMS and log aggregation systems
  7. 7. Concepts ● Kafka is run as a cluster on one or more servers ● The Kafka cluster stores streams of records in categories called topics ● Each topic divided into one or more partitions ● Each record consists of a key, a value, and a timestamp
  8. 8. Topic Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. Logs are rotated and deleted based on policy configuration.
  9. 9. Producers & Consumers Offsets are controlled by consumer. Just a number (very cheap).
  10. 10. Partition ● Allow the log to scale beyond a size that will fit on a single server ● Act as the unit of parallelism
  11. 11. Replication ● Topics can (and should) be replicated ● Unit of replication is partition ● Each partition has 1 leader and 0 or more replicas ● ISR = In-Sync Replica
  12. 12. Replication
  13. 13. Durability Guarantees ● Options ○ Do not wait for ACK ○ Wait for ACK from leader ○ Wait for ACK from all ISRs ● Disable unclean leader election ● Specify a minimum ISR size
  14. 14. Replica Management ● One broker elected as controller ● ZooKeeper used for metadata and coordination ● Rebalancing, rebalancing, rebalancing...
  15. 15. Message Delivery Semantics ● At most once — Messages may be lost but are never redelivered. ● At least once — Messages are never lost but may be redelivered. ● Exactly once — this is what people actually want, each message is delivered once and only once.
  16. 16. Log Compaction
  17. 17. Streams ● Simple and lightweight client library ● Transparently handles the load balancing of multiple instances ● Fault-tolerant local state ● Time based window operations ● Map, Filter, Join, ...
  18. 18. Serialization JSON, Avro, Protobuf, Thrift, XML?
  19. 19. Compression ● Compression at topic level ● Compressed by producer in batches ● Codecs ○ Gzip ○ Snappy ○ Lz4
  20. 20. Message Format 1. 4 byte CRC32 of the message 2. 1 byte "magic" identifier to allow format changes, value is 0 or 1 3. 1 byte "attributes" identifier to allow annotations on the message independent of the version bit 0 ~ 2 : Compression codec. 0 : no compression 1 : gzip 2 : snappy 3 : lz4 bit 3 : Timestamp type 0 : create time 1 : log append time bit 4 ~ 7 : reserved 4. (Optional) 8 byte timestamp only if "magic" identifier is greater than 0 5. 4 byte key length, containing length K 6. K byte key 7. 4 byte payload length, containing length V 8. V byte payload
  21. 21. Use Cases ● Messaging System (ActiveMQ, RabbitMQ) ● Storage System ● Stream Processing (Storm, Samza, Spark) ● Log Aggregation (Scribe, Flume) ● Metrics
  22. 22. Efficiency ● Main reasons for high throughput and low latency ○ Batch of individual messages ○ Zero copy I/O using sendfile() ○ Heavily relies on Linux PageCache
  23. 23. Performance
  24. 24. Development - API ● Producer API ● Consumer API ● Streams API ● Connect API
  25. 25. Development ● Spring-kafka ● Spring Integration Kafka ● Spring Boot
  26. 26. Operations ● Command line ● JMX monitoring ● Kafka Manager ● ZooKeeper cluster
  27. 27. Mirroring data between clusters ● Provide a replica in another datacenter ● Different number of partitions ● Order by key is preserved, but offset different ● Do not use as fault-tolerance mechanism
  28. 28. Java Forever Thank You!

×