Kafka blr-meetup-presentation - Kafka internals

Kafka Internals
Ayyappadas Ravindran
Linkedin Bangalore SRE Team

Introduction
• Who am I ?
– Ayyappadas Ravindran
– Staff SRE in Linkedin
– Responsible for Data Infra Streaming team
• What is this talk about ?
– Kafka building blocks in details
– Operating Kafka
– Data assurance with Kafka
–Kafka 0.9

Agenda
• Kafka – Reminder !
• Zookeeper
• Kafka Cluster – Brokers
• Kafka – Message
• Producers
• Schema Registry
• Consumers
• Data Assurance
• What is new in Kafka (Kafka 0.9)
• Q & A

Kafka Pub/Sub Basics – Reminder !
Broker
A
P0
A
P1
A
P0
Consumer
Producer
Zookeeper

Zookeeper
• Distributed coordination service
• Also used for maintaining configuration
• Guarantees
– Order
– Atomicity
– Reliability
• Simple API
• Hierarchical Namespace
• Ephemeral Nodes
• Watches

Zookeeper in Kafka ecosystem
• Used to store metadata information
– About brokers
– About topics & partitions
– Consumers / Consumer groups
• Service coordination
– Controller election
– For administrative tasks

Zookeeper at Linkedin
• We are running Zookeeper 3.4
• Cluster of 5 (participants) + 1 (observer)
• Network and power redundancy
• Transaction logs on SSD.
• Lesson Learned : Do not over build your cluster

Kafka Cluster - Brokers
• Brokers
– Runs Kafka
– Stores commit logs
• Why cluster ?
– Redundancy and fault tolerance
– Horizontal scalability
– Improves reads and writes. Better network usage & disk IO
• Controller – special broker

Kafka Message
• Distributed partition replicated commit log.
• Messages
– Fixed size Header
– Variable length Payload (byte array)
– Payload can have any serialized data.
– Linkedin uses Avro
• Commit Logs
– Stored in sequence file under folders named with topic name
– contains sequence of log entries

Kafka Message - continued
• Logs
– Log entry (message) have 4 byte header and followed N byte messages
– offset is a 64 byte integer
– offset give the position of message from the start of the stream
– on disk log files are saved as segment files
– segment files are named with the first offset message in that file. E.g.
00000000000.kafka

Kafka Message - continued
• Write to logs
– Appends to the latest segment file
– OS flushes the messages to disk either based on number of messages or time
• Reads from logs
– Consumer provides offset & a chunk size
– Returns an iterator to iterate over the message set
– On failure, consumers can start consuming from either the start of the stream or from
latest offset

Message Retention
• Kafka retains and expires messages via three options
– Time-based (the default, which keeps messages for at least 168 hours)
– Size-based (configurable amount of messages per-partition)
– Key-based (one message is retained for each discrete key)
• Time and size retention can work together, but not with key-based
– With time and size configured, messages are retained either until the size limit is reached
OR the time limit is reached, whichever comes first
• Retention can be overridden per-topic
– Use the kafka-topics.sh CLI to set these configs

Kafka Producer
• Producer publishes message to topic
– metadata.broker.list
– serializer.class
– partitioner.class
– request.required.acks (0,1,-1)
– topics
• Partition strategy
– DefaultPartitioner – Round Robin
– DefaultPartitioner with Keyed messages – Hashing
Ref : https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+Producer+Example

Kafka Producer - Continued
• Message Batching
• Compression (gzip, snappy & lz4)
• Sticky partition
• CLI
– Create a topic
• bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic
newtopic --replication-factor 1 --partitions 1
– Produce messages
• bin/kafka-console-producer.sh –broker-list localhost:9092 -–topic
newtopic
Ref : https://cwiki.apache.org/confluence/display/KAFKA/Clients

Kafka consumer
• Consumer are the processes subscribed to a topic and that processes the feeds
•High level consumer
– multi threaded
– manages offset for you
• Simple consumer
– Greater control over consumption
– Need to manage offset
– Need to find broker for leader partition

Kafka Consumer -- continued
• Important options to provide while consuming
– Zookeeper details
– Topic name
– Where to start consuming (from beginning or from the tail)
– auto.offset.reset
– group.id
– auto.commit.enable (true)
• console consumer
– Helps in debugging issues & can be used inside application
– bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic mytopic
--from-beginning

Basic Kafka operations
• Add a topic
– bin/kafka-topics.sh --zookeeper zk_host:port/chroot --create --topic
newtopic --partitions 10 --replication-factor 3 --config x=y
• Modify topic
– bin/kafka-topics.sh --zookeeper zk_host:port/chroot --alter --topic
newtopic –partitions 20
– beware this may impact semantically partitioned topic
• Modify configuration
newtopic --config x=y
• Delete configurations
newtopic --deleteConfig x

Basic Kafka operations -- continued
• DO NOT DELETE TOPICS ! Though you have an option to do that
• What happens when a broker dies ?
– Leader fail over
– corrupted index / log files
– URP
– Uneven leader distribution
•Preferred replica election
– bin/kafka-preferred-replica-election.sh --zookeeper zk_host:port/chroot
– or auto.leader.rebalance.enable=true

Adding a broker
20
Brokers
Consumers
Producers
A
P1
A
P0
B
P1
B
P0
A
P5
A
P4
B
P5
B
P4
A
P3
A
P2
B
P3
B
P2
A
P7
A
P6
B
P7
B
P6
A
P5
A
P4
B
P5
B
P4
A
P1
A
P0
B
P1
B
P0
A
P7
A
P6
B
P7
B
P6
A
P3
A
P2
B
P3
B
P2
C
P1
C
P0
C
P3
C
P2
C
P1
C
P0
C
P3
C
P2

Kafka operations – continued
• Expanding Kafka cluster
– Create a brokers with new broker ID
– Will not automatically move topics to new brokers
– Admin need to initiate the move
• Generate the plan : bin/kafka-reassign-partitions.sh --zookeeper
localhost:2181 --topics-to-move-json-file topics-to-move.json --
broker-list "5,6" –generate
• Execute the plan : bin/kafka-reassign-partitions.sh --zookeeper
localhost:2181 --reassignment-json-file expand-cluster-
reassignment.json –execute
• Verify the execution : bin/kafka-reassign-partitions.sh --zookeeper
localhost:2181 --reassignment-json-file expand-cluster-
reassignment.json --verify

Data Assurance
• No data loss or no reordering
– Critical for applications like DB replication
– Can Kafka do this ? Yes !
• Cause of data loss on producer side
– setting block.on.buffer.full=false
– retires exhausting
– sending messages with out ack=all
• How can you fix ?
– set block.on.buffer.full=true
– set retired to Long.MAX_VALUE
– set acks to all
– have resend in your call back function (producer.send(record, callback))

Data Assurance - Continued
• Cause of data loss on consumer side
– offsets are carelessly committed
– data loss can happen if consumer committed the offset, but died while processing the
message
• Fixing data loss on consumer side
– commit offset only after processing of the message is completed
– disable auto.offset.commit
• Fixing on Broker Side
– have replication factor >= 3
– have min.isr 2
– disable unclean leader election

Data Assurance - Continued
• Message reordering
– If more than one message is in transit
– and also retry is enabled
• Fixing message reordering
– set max.in.flight.requests.per.connection=1

Kafka 0.9 (Beta release)
• Security
– Kerberos or TLS based authentication
– Unix like permission to restrict who can access data
– Encryption on the wire Via SSL
• Kafka Connect
– support large-scale real-time import and export for Kafka
– takes care of fault tolerance, offset management and delivery management
– will be supporting connectors for Hadoop and database
• User defined quota
– To manage abusive clients
– rate limit traffic or producer side and consumer side

Kafka 0.9 (Beta release)
– Allows only 10MBps for read and 5MBps for write
– If clients violate, slows down
– Can be overridden
• New Consumer
– Removes distinction between high level consumer and simple consumer
– Unified consumer API
– No longer zookeeper dependent
– Offers pluggable offset management

How Can You Get Involved?
•http://kafka.apache.org
•Join the mailing lists
–users@kafka.apache.org
• irc.freenode.net - #apache-kafka
27

Q & A
Want to contact us ?
Akash Vacher (avacher@linkedin.com)
Ayyappadas Ravindran (appu@linkedin.com)
Talent Partner : Syed Hussain (sshussain@linkedin.com)
Mob : +91 953 581 8876

Kafka blr-meetup-presentation - Kafka internals

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kafka blr-meetup-presentation - Kafka internals

Similar to Kafka blr-meetup-presentation - Kafka internals (20)

Recently uploaded

Recently uploaded (20)

Kafka blr-meetup-presentation - Kafka internals

Editor's Notes