Successfully reported this slideshow.

Introduction to Kafka

5

Share

Upcoming SlideShare
Apache Kafka Introduction
Apache Kafka Introduction
Loading in …3
×
1 of 33
1 of 33

Introduction to Kafka

5

Share

Download to read offline

This is a presentation introducing Kafka and some of its core concepts as presented at the Sydney ALT.NET meet-up on 26 April 2016.

This is a presentation introducing Kafka and some of its core concepts as presented at the Sydney ALT.NET meet-up on 26 April 2016.

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Introduction to Kafka

  1. 1. Introduction to Kafka BY DUCAS FRANCIS
  2. 2. The problem Web Security System Real-time Monitoring Logging System Other services Mobile API Job It’s simple enough at first… Then it gets a little busy… And ends up a mess.
  3. 3. The solution Web Security System Real-time Monitoring Logging System Other services Mobile API Job Pub/Sub Decouple data pipelines using a pub/sub system Producers Brokers Consumers
  4. 4. Apache Kafka A UNIFIED, HIGH- THROUGHPUT, LOW-LATENCY PLATFORM FOR HANDLING REAL-TIME DATA FEEDS
  5. 5. A brief history lesson  Originally developed at LinkedIn in 2011  Graduated Apache Incubator in 2012  Engineers from LinkedIn formed Confluent in 2014  Up to version 0.9.4 with 0.10 on horizon
  6. 6. Motivation  Unified platform for all real-time data feeds  High throughput for high volume streams  Support periodic data loads from offline systems  Low latency for traditional messaging  Support partitioned, distributed, real-time processing  Guarantee fault-tolerance
  7. 7. Common use cases  Messaging  Website activity tracking  Metrics  Log aggregation  Stream processing  Event sourcing  Commit log
  8. 8. Benefits of Kafka  High throughput  Low latency  Load balancing  Fault tolerant  Guaranteed delivery  Secure
  9. 9. Performance comparison
  10. 10. Batch performance comparison
  11. 11. Some terminology  Topic – feed of messages  Producer – publishes messages to a topic  Consumer – subscribes to topics and processes the feed of messages  Broker – server instance that acts in a cluster
  12. 12. @apachekafkapowers @microsot…
  13. 13. Libraries  Python – kafka-python / pykafka  Go – sarama / go_kafka_client / …  C/C++ - librdkafka / libkafka / …  .NET – kafka-net (x2) / rdkafka-dotnet / CSharpClient-for-Kafka  Node.js – kafka-node / sutoiku/node-kafka / ...  HTTP – kafka-pixy / kafka-rest  etc.
  14. 14. Architecture Producer Producer Broker BrokerBroker Consumer ConsumerZookeeper Cluster x3
  15. 15. Show me the Kafka!!! VAGRANT TO THE RESCUE
  16. 16. Anatomy of a topic  Topics are broken into partitions  Messages are assigned sequential ID called and offset  Data is retained for a configurable period of time  Number of partitions can be increased after creation, but not decreased  Partitions are assigned to brokers Each partition is an ordered, immutable sequence of messages that is continually appended to… a commit log.
  17. 17. Broker  Kafka service running as part of a cluster  Receives messages from producers and serves them to consumers  Coordinated using Zookeeper  Need odd number for quorum  Store messages on the file system  Replicate messages to/from other brokers  Answer metadata requests about brokers and topics/partitions  As of 0.9.0 – coordinate consumers
  18. 18. Replication  Partitions on a topic should be replicated  Each partition has 1 leader and 0 or more followers  An In-Sync Replica (ISR) is one that’s communicating with Zookeeper and not too far behind the leader  Replication factor can be increased after creation, not decreased
  19. 19. ./kafka-topics --CREATE --REPLICATION-FACTOR --PARTITIONS --DESCRIBE
  20. 20. Producers  Publishes messages to a topic  Distributes messages across partitions  Round-robin  Key hashing  Send synchronously or asynchronously to the broker that is the leader for the partition  ACKS = 0 (none),1 (leader), -1 (all ISRs)  Synchronous is obviously slower, but more durable
  21. 21. Testing... Testing… 1 2 3 LET’S SEE HOW FAST WE CAN PUSH
  22. 22. Consumers  Read messages from a topic  Multiple consumers can read from the same topic  Manage their offsets  Messages stay on Kafka after they are consumed
  23. 23. Testing... Testing… 1 2 3 LET’S SEE HOW FAST WE CAN RECEIVE
  24. 24. It’s fast! But why…?  Efficient protocol based on message set  Batching messages to reduce network latency and small I/O operations  Append/chunk messages to increase consumer throughput  Optimised OS operations  pagecache  sendfile()  Broker services consumers from cache where possible  End-to-end batch compression
  25. 25. Load balanced consumers  Distribute load across instances in a group by allocating partitions  Handle failure by rebalancing partitions to other instances  Commit their offsets to Kafka Cluster Broker 1 Broker 2 P0 P1 P2 P3 Consumer Group 1 C0 C1 Consumer Group 2 C2 C3 C4 C6
  26. 26. Consumer groups and offsets Cluster Broker 1 Broker 2 P0 P1 P2 P3 Consumer Group 1 C0 C1 0 1 2 3 4 5 6 7 8 9 10P3 C1 read C1 commit C0 read C0 commit
  27. 27. Guarantees  Messages sent by a producer to a particular topic’s partition will be appended in the order they are sent  A consumer instance sees messages in the order they are stored in the log  For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any messages committed to the log
  28. 28. Ordered delivery  Messages are guaranteed to be delivered in order by partition, NOT topic M1 M3 M5 M2 M4 M6 P0 P1  M1 before M3 before M5 – YES  M1 before M2 – NO  M2 before M4 before M6 – YES  M2 before M3 - NO
  29. 29. Enough ALT… now .NET USING RDKAFKA-DOTNET
  30. 30. FIN. THANK YOU
  31. 31. Resources  http://kafka.apache.org/documentation.html  http://www.confluent.io/  https://kafka.apache.org/090/configuration.html  https://github.com/edenhill/librdkafka  https://github.com/ah-/rdkafka-dotnet
  32. 32. Log compaction  Keep the most recent payload for a key  Use cases  Database change subscription  Event sourcing  Journaling for HA
  33. 33. Log compaction

Editor's Notes

  • High throughput – web activity tracking receiving 10’s of events per page hit or interaction.

    Periodic data loads – every 5min receving 100,000s messages

    Low latency – pub/sub in ms

    Distributed – anyone sending or receiving messages should be able to accomplish HA
  • cd ~/Projects/kafka-vagrant

    vagrant status

    vagrant up

    vagrant ssh kafka-1

    cat /etc/kafka/server.properties

    https://kafka.apache.org/090/configuration.html
  • Topic – feed of messages
    Partition – topics are broken into partitions
    Messages – written to the end of a partition within a topic and assigned a sequential identifier (a 64bit integer) which is called an offset

    Data is retained within a partition for a configurable amount of time. The time is defaulted in broker configuration, but can be set per topic. Messages are stored on the file system in segmented files.

    Number of partitions can be increased after creation, but not decreased. This is because (as mentioned) the messages are stored on the file system on a per-partition basis, so reducing partitions would be effectively deleting data.

    Partitions are assigned to brokers – not topics. Kafka attempts to balance the number of partitions across the available brokers, which can be manually configured too. This is how kafka attempts to load balance its activity because, in theory, each broker having an equal number of partitions should receive an equal number of send and fetch requests.


  • The responsibilities of coordination are mixed between ZK and Kafka. Older versions of kafka relied more on ZK, but this is being brought more into the broker and ZK is being used more for service discovery and configuration.

    Before 0.9.0, consumers were coordinated by ZK and had to have a lot of logic around which partitions were assigned to them. This was changed so that for a new consumer a broker is assigned to be the consumer coordinator and tell the consumers which partitions were assigned to them.
  • cd ~/Downloads/confluent-2.0.0/bin
    ls
    ./kafka-topics
    ./kafka-topics --create --topic perf-test --partitions 10 --replication-factor 3 --zookeeper 192.168.32.11:2181
    ./kafka-topics --list --zookeeper 192.168.32.12
    ./kafka-topics --describe --topic perf-test --zookeeper 192.168.32.13
  • ./kafka-producer-perf-test

    # Publish 10k x 4kb messages
    ./kafka-producer-perf-test --topic perf-test --num-records 10000 --record-size 4096 --throughput 1000 --producer-props bootstrap.servers=192.168.32.21:9092,192.168.32.22:9092,192.168.32.23:9092

    # Up the throughput for 100k
    ./kafka-producer-perf-test --topic perf-test --num-records 100000 --record-size 4096 --throughput 100000 --producer-props bootstrap.servers=192.168.32.21:9092,192.168.32.22:9092,192.168.32.23:9092

    # No ACKs
    ./kafka-producer-perf-test --topic perf-test --num-records 1000000 --record-size 4096 --throughput 100000 --producer-props bootstrap.servers=192.168.32.21:9092,192.168.32.22:9092,192.168.32.23:9092 acks=0

    # ACKs from all ISRs
    ./kafka-producer-perf-test --topic perf-test --num-records 1000000 --record-size 4096 --throughput 100000 --producer-props bootstrap.servers=192.168.32.21:9092,192.168.32.22:9092,192.168.32.23:9092 acks=-1

    # ACK from leader, use snappy
    ./kafka-producer-perf-test --topic perf-test --num-records 1000000 --record-size 4096 --throughput 100000 --producer-props bootstrap.servers=192.168.32.21:9092,192.168.32.22:9092,192.168.32.23:9092 acks=1 compression.type=snappy

    # linger for 100ms
    ./kafka-producer-perf-test --topic perf-test --num-records 1000000 --record-size 4096 --throughput 100000 --producer-props bootstrap.servers=192.168.32.21:9092,192.168.32.22:9092,192.168.32.23:9092 acks=1 compression.type=snappy linger.ms=100
  • # Consumer 1M on 5 threads
    ./kafka-consumer-perf-test --zookeeper 192.168.32.11 --topic perf-test --messages 1000000 --group perf-test --threads 5

  • Modern OSs maintain a page cache and aggressively use main memory for disk caching. By NOT utilizing this and storing an in-memory representation of data you’re effectively doubling up on the amount of memory you’re application is consuming. By utilizing this you’re utilizing all available RAM for caching without GC penalties. It’s also kept in memory even if the application is restarted.

    This is obviously advantageous when reading messages, but also when writing.

    Rather than maintain as much as possible in-memory and flush it all out to the file system in a panic when we run out of space, we invert that. All data is immediately written to a persistent log on the filesystem without necessarily flushing to disk. In effect this just means that it is transferred into the kernel's pagecache.


    Modern unix operating systems offer a highly optimized code path for transferring data out of pagecache to a socket – the sendfile system call.

    OS reads data from a file into pagecache in kernel space
    Application reads from kernel space to a user space buffer
    Application writes data back to kernel space into a socker buffer
    OS copies from socket buffer to NIC buffer to send over the network

    sendfile avoids this by instructing the OS to send data directly from the pagecache to the NIC. This means that consumers that are caught up will be served completely from memory.
  • Kafka scales topic consumption by distributing partitions among a consumer group, which is a set of consumers sharing a common group identifier.

    For each group a broker is selected as the group coordinator. The coordinator is responsible for managing the state of the group. Its main job is to mediate partition assignment when new members arrive, old members depart, and when topic metadata changes. The act of reassigning partitions is known as rebalancing the group.
  • ×