Introduction to apache kafka

Introduction to Apache Kafka
2016.06.14
skerrien
samuel.kerrien@gmail.com

About Me
● 14 years Java Developer / Tech lead / ScrumMaster
● 3 years Data Engineer (Hadoop, Hive, Pig, Mahout)
skerrien
samuel.kerrien@gmail.com

Meetup Outline
● Purpose of Kafka
● Architecture
● Demo: Zookeeper
● More Kafka Internals
● Demos: writing Clients
● Discussions

If data is the lifeblood of high technology,
Apache Kafka is the circulatory system in use at LinkedIn.
-- Todd Palino
Source: https://engineering.linkedin.com/kafka/running-kafka-scale

How Kafka Came To Be At LinkedIn
Source: http://www.infoq.com/presentations/kafka-big-data [Neha Narkhede]

A Week @ LinkedIn
630,516,047 msg/days (avg per broker)
7,298 msg/sec (avg per broker)
Source: http://www.confluent.io/kafka-summit-2016-ops-some-kafkaesque-days-in-operations-at-linkedin-in-2015 [Joel Koshy]

Kafka Use Cases
Messaging
Web Site Activity Tracking
Metrics
Log Aggregation
Stream Processing
Source: https://kafka.apache.org/08/uses.html
● Requirements
○ Low throughput
○ Low latency
○ Durability
● Replacement for JMS
● Decoupling producer/consumer
● Kafka
○ Better throughput
○ Partitioning
○ Replication
○ Fault tolerance

Kafka Use Cases
Messaging
Metrics
Log Aggregation
Stream Processing
● Requirements
○ Very high volume
● Track user activity
○ Page view
○ Searches
○ Actions
● Goals
○ Real time processing
○ Monitoring
○ Load into Hadoop
■ Reporting

Kafka Use Cases
Messaging
Metrics
Log Aggregation
Stream Processing
● Requirements
○ Very high volumes
● Feed of operational data
○ VMs
○ Apps

Kafka Use Cases
Messaging
Metrics
Log Aggregation
Stream Processing
● Log file collection on servers
● Similar to Scribe or Flume
○ Equally goof performance
○ Stronger durability
○ Lower end-to-end ltency

Kafka Use Cases
Messaging
Metrics
Log Aggregation
Stream Processing
● Stage of processing = topic
● Companion framework
○ Storm
○ Spark
○ ...

How Other Companies Use Kafka
LinkedIn: activity streams, operational metrics
Yahoo: real time analytics (peak: 20Gbps compressed data), Kafka Manager
Twitter: storm stream processing
Netflix: real time monitoring, event processing pipelines
Spotify: log delivery system
Airbnb: event pipelines
. . .

Kafka Architecture
Brokers
Producer
Consumer Consumer Consumer Consumer
Producer Producer Producer
Topic
Topic
Topic
Topic
Topic
Topic Zookeeper
offset

Kafka Controller
One broker take the role of Controller which manages:
● Partition leaders
● State of partitions
● Partition reassignments
● Replicas

Kafka characteristics
Fast
Scalable
Durable
Distributed
● Single broker
○ Serve 100s MB/s
○ 1000s clients

Fast
Scalable
Durable
Distributed
● Cluster as data backbone
● Expanded elastically

Fast
Scalable
Durable
Distributed
● Messages persisted on disk
● TB per broker with no performance impact
● Configurable retention

Fast
Scalable
Durable
Distributed
● Cluster can server larger streams than a single
machine can

Anatomy Of A Message
MagicCRC Key Length Key Msg Length MsgAttributes
4 bytes 1 byte 1 byte 4 bytes K bytes 4 bytes M bytes

Zookeeper Architecture
ZK
Node
ZK
Node
ZK
Node
ZK
Node
ZK
Node
Zookeeper Service
Client Client Client Client Client Client
/

How Kafka Uses Zookeeper (1/2)
● Broker membership to a cluster
● Election of controller
● Topic configuration (#partitions, replica location, leader)
● Consumer offsets (alternative option since 0.8.2)
● Quotas (0.9.0)
● ACLs (0.9.0)

How Kafka Uses Zookeeper (2/2)
Kafka zNodes Structure
● /brokers/ids/[0...N] --> host:port (ephemeral node)
● /brokers/topics/[topic]/[0...N] --> nPartions (ephemeral node)
● /consumers/[group_id]/ids/[consumer_id] --> {"topic1": #streams, ..., "topicN": #streams} (ephemeral node)
● /consumers/[group_id]/offsets/[topic]/[broker_id-partition_id] --> offset_counter_value (persistent node)
● /consumers/[group_id]/owners/[topic]/[broker_id-partition_id] --> consumer_node_id (ephemeral node)

Zookeeper Demo
● List zNodes
● Create zNode
● Update zNode
● Delete zNode
● Ephemeral zNode
● Watches (data update & node deletion)

Kafka Demo 1
● Install Kafka / Zookeeper (brew install kafka -> Kafka 0.8.2.1 + ZK 3.4.6 )
● Start a single broker cluster
○ Create a topic
○ Create a producer (Shell)
○ Create a consumer (Shell)
● Start a multi-broker cluster
○ Create a topic (partitioned & replicated)
○ Run producer & consumer
○ Kill a broker / check topic status (leader, ISR)

Write Ahead Log / Commit Log
0 1 2 3 4 5 6 7 8 9 10 11 12 13

Write Ahead Log / Commit Log
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Consumer A
Consumer B

Why Commit Log ?
● Records what happened and when
● Databases
○ Record changes to data structures (physical or logical)
○ Used for replication
● Distributed Systems
○ Update ordering
○ State machine Replication principle
○ Last log timestamp defines its state
Source: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Topic Partitioning
Producer
Application
Consumer
Application
Partition 1
Topic
writes reads
Brokers
Brokers
Partition 2
Partition N
Brokers

Topic Replication
kafka-topics.sh --zookeeper $ZK --create --topic events --partitions 3 --replication-factor 2
P1 P2 P3
Topic
Logical
Broker 1
Broker 2
Broker 3
P1
P1
P2 P3
P2
P3
Physical

Kafka Guaranties
● Messages sent by a producer to a particular topic partition will be appended in the
order they are sent.
● A consumer instance sees messages in the order they are stored in the log.
● For a topic with replication factor N, Kafka will tolerate up to N-1 server failures
without losing any messages committed to a topic.

Durability Guarantees
Producer can configure acknowledgements:
● <= 0.8.2: request.required.acks
● >= 0.9.0: acks
Value Impact Durability
0 Producer doesn’t wait for leader weak
1 (default)
Producer waits for leader.
Leader sends ack when message written to log.
No wait for followers.
medium
all (0.9.0)
-1 (0.8.2)
Producer waits for leader.
Leader sends ack when all In-Sync-Replica have
acknowledged.
strong

Consumer Offset Management
● < 0.8.2: Zookeeper only
○ Zookeeper not meant for heavy write => scalability issues
● >= 0.8.2: Kafka Topic (__consumer_offset)
○ Configurable: offsets.storage=kafka
● Documentation show how to migrate offsets from Zookeeper to Kafka
http://kafka.apache.org/082/documentation.html#offsetmigration

Data Retention
3 ways to configure it:
● Time based
● Size based
● Log compaction based
Broker Configuration
log.retention.bytes={ -1|...}
log.retention.{ms,minutes,hours}=...
log.retention.check.interval.ms=...
log.cleanup.policy={delete|compact}
log.cleaner.enable={ false|true}
log.cleaner.threads=1
log.cleaner.io.max.bytes.per.second=Double.MaxValue
log.cleaner.backoff.ms=15000
log.cleaner.delete.retention.ms=1d
Topic Configuration
cleanup.policy=...
delete.retention.ms=...
...
Reconfiguring a Topic at Runtime
kafka-topics.sh --zookeeper localhost:2181
--alter --topic my-topic
--config max.message.
bytes=128000
kafka-topics.sh --zookeeper localhost:2181
--alter --topic my-topic
--deleteConfig max.message.bytes

Log Compaction (1/4)
$ kafka-topics.sh --zookeeper localhost:2181
--create
--topic employees
--replication-factor 1
--partitions 1
--config cleanup.policy=compact
$ echo '00158,{"name":"Jeff", "title":"Developer"}' | kafka-console-producer.sh
--broker-list localhost:9092
--topic employees
--property parse.key=true
--property key.separator=,
--new-producer
$ echo '00223,{"name":"Chris", "title":"Senior Developer"}' | kafka-console-producer.sh
--topic employees
--new-producer
Source: http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/

$ echo '00158,{"name":"Jeff", "title":"Senior Developer"}' | kafka-console-producer.sh
--topic employees
--new-producer
$ kafka-console-consumer.sh --zookeeper localhost:2181
--topic employees
--from-beginning
--property print.key=true
00158,{"name":"Jeff", "title":"Developer"}
00223,{"name":"Chris", "title":"Senior Developer"}
00158,{"name":"Jeff", "title":"Senior Developer"}

$ kafka-topics.sh --zookeeper localhost:2181
--alter
--topic employees
--config segment.ms=30000
<... Wait 30 seconds ...>
--topic employees
--from-beginning
00158,{"name":"Jeff", "title":"Developer"}

$ echo '00301,{"name":"Dan", "title":"Project Manager"}' | kafka-console-producer.sh
--topic employees
--new-producer
--topic employees
--from-beginning
00301,{"name":"Dan", "title":"Project Manager"}
...
[2015-06-25 18:24:08,102] INFO [kafka-log-cleaner-thread-0],
Log cleaner thread 0 cleaned log employees-0 (dirty section =
[0, 3])
0.0 MB of log processed in 0.1 seconds (0.0 MB/sec).
Indexed 0.0 MB in 0.1 seconds (0.0 Mb/sec, 90.6% of total
time)
Buffer utilization: 0.0%
Cleaned 0.0 MB in 0.0 seconds (0.0 Mb/sec, 9.4% of total
time)
Start size: 0.0 MB (3 messages)
End size: 0.0 MB (2 messages)
31.0% size reduction (33.3% fewer messages)
(kafka.log.LogCleaner)

Kafka Performance - Theory
● Efficient Storage
○ Fast sequential write and read
○ Leverages OS page cache (i.e. RAM)
○ Avoid storing data twice in JVM and in OS cache (better perf on startup)
○ Caches 28-30GB data in 32GB machine
○ Zero copy I/O using IBM’s sendfile API (https://www.ibm.com/developerworks/library/j-zerocopy/)
● Batching of messages + compression
● Broker doesn’t hold client state
● Dependent on persistence guaranties request.required.acks

Kafka Benchmark (1/5)
● 0.8.1
● Setup
○ 6 Machines
■ Intel Xeon 2.5 GHz processor with six cores
■ Six 7200 RPM SATA drives (822 MB/sec of linear disk I/O)
■ 32GB of RAM
■ 1Gb Ethernet
○ 3 nodes for brokers + 3 for ZK and clients
Source: https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

Usecase Messages/sec
Throughput
(MB/sec)
1 producer thread, no replication
6 partitions, 50M messages, payload: 100 bytes
821,557 78.3
1 producer thread, 3x asynchronous replication (ack=1) 786,980 75.1
1 producer thread, 3x synchronous replication (ack=all) 421,823 40.2
3 producers, 3x asynchronous replication 2,024,032 193.0
Producer throughput:

Sustained producer throughput

Usecase Messages/sec
Throughput
(MB/sec)
1 Consumer 940,521 89.7
3 Consumer (1 per machine) 2,615,968 249.5
Consumer throughput from 6 partitions, 3x replication:

End-to-end latency:
● 2ms (median)
● 3ms (99th percentile)
● 14ms (99.9th percentile)

Kafka Demo 2 - Java High Level API
● Start a multi broker cluster + replicated topic
○ Write Java Partitioned Producer
○ Write Multi-Threaded Consumer
● Unit testing with Kafka

Kafka Versions
0.8.2 (2014.12)
● New Producer API
● Delete topic
● Scalable offset writes
0.9.x (2015.10)
● Security
○ Encryption
○ Kerberos
○ ACLs
● Quotas (client rate control)
● Kafka Connect
0.10.x (2016.03)
● Kafka Streams
● Rack Awareness
● More SASL features
● Timestamp in messages
● API to better manage Connectors

Starting with Kafka: Jay Kreps’ Recommendations
● Start with a single cluster
● Only a few non critical, limited usecases
● Pick a single data format for a given organisation
○ Avro
■ Good language support
■ One schema per topic (message validation, documentation...)
■ Supports schema evolution
■ Data embeds schema
■ Make Data Scientists job easier
■ Put some thoughts into field naming conventions
Source: http://www.confluent.io/blog/stream-data-platform-2/

Conclusions
● Easy to start with for a PoC
● Maybe not so easy to build a production system from scratch
● Must have serious monitoring in place (see Yahoo, Confluent, DataDog)
● Vibrant community, fast pace technology
● Videos of Kafka Summit are online: http://kafka-summit.org/sessions/
https://github.com/samuel-kerrien/kafka-demo

Introduction to apache kafka

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Introduction to apache kafka

Similar to Introduction to apache kafka (20)

Recently uploaded

Recently uploaded (20)

Introduction to apache kafka