2. Introduction
• Who am I ?
– Ayyappadas Ravindran
– Staff SRE in Linkedin
– Responsible for Data Infra Streaming team
• What is this talk about ?
– Kafka building blocks in details
– Operating Kafka
– Data assurance with Kafka
–Kafka 0.9
3. Agenda
• Kafka – Reminder !
• Zookeeper
• Kafka Cluster – Brokers
• Kafka – Message
• Producers
• Schema Registry
• Consumers
• Data Assurance
• What is new in Kafka (Kafka 0.9)
• Q & A
4. Kafka Pub/Sub Basics – Reminder !
Broker
A
P0
A
P1
A
P0
Consumer
Producer
Zookeeper
5. Zookeeper
• Distributed coordination service
• Also used for maintaining configuration
• Guarantees
– Order
– Atomicity
– Reliability
• Simple API
• Hierarchical Namespace
• Ephemeral Nodes
• Watches
6. Zookeeper in Kafka ecosystem
• Used to store metadata information
– About brokers
– About topics & partitions
– Consumers / Consumer groups
• Service coordination
– Controller election
– For administrative tasks
7. Zookeeper at Linkedin
• We are running Zookeeper 3.4
• Cluster of 5 (participants) + 1 (observer)
• Network and power redundancy
• Transaction logs on SSD.
• Lesson Learned : Do not over build your cluster
9. Kafka Message
• Distributed partition replicated commit log.
• Messages
– Fixed size Header
– Variable length Payload (byte array)
– Payload can have any serialized data.
– Linkedin uses Avro
• Commit Logs
– Stored in sequence file under folders named with topic name
– contains sequence of log entries
10. Kafka Message - continued
• Logs
– Log entry (message) have 4 byte header and followed N byte messages
– offset is a 64 byte integer
– offset give the position of message from the start of the stream
– on disk log files are saved as segment files
– segment files are named with the first offset message in that file. E.g.
00000000000.kafka
11. Kafka Message - continued
• Write to logs
– Appends to the latest segment file
– OS flushes the messages to disk either based on number of messages or time
• Reads from logs
– Consumer provides offset & a chunk size
– Returns an iterator to iterate over the message set
– On failure, consumers can start consuming from either the start of the stream or from
latest offset
12. Message Retention
• Kafka retains and expires messages via three options
– Time-based (the default, which keeps messages for at least 168 hours)
– Size-based (configurable amount of messages per-partition)
– Key-based (one message is retained for each discrete key)
• Time and size retention can work together, but not with key-based
– With time and size configured, messages are retained either until the size limit is reached
OR the time limit is reached, whichever comes first
• Retention can be overridden per-topic
– Use the kafka-topics.sh CLI to set these configs
16. Kafka consumer
• Consumer are the processes subscribed to a topic and that processes the feeds
•High level consumer
– multi threaded
– manages offset for you
• Simple consumer
– Greater control over consumption
– Need to manage offset
– Need to find broker for leader partition
17. Kafka Consumer -- continued
• Important options to provide while consuming
– Zookeeper details
– Topic name
– Where to start consuming (from beginning or from the tail)
– auto.offset.reset
– group.id
– auto.commit.enable (true)
• console consumer
– Helps in debugging issues & can be used inside application
– bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic mytopic
--from-beginning
19. Basic Kafka operations -- continued
• DO NOT DELETE TOPICS ! Though you have an option to do that
• What happens when a broker dies ?
– Leader fail over
– corrupted index / log files
– URP
– Uneven leader distribution
•Preferred replica election
– bin/kafka-preferred-replica-election.sh --zookeeper zk_host:port/chroot
– or auto.leader.rebalance.enable=true
21. Kafka operations – continued
• Expanding Kafka cluster
– Create a brokers with new broker ID
– Will not automatically move topics to new brokers
– Admin need to initiate the move
• Generate the plan : bin/kafka-reassign-partitions.sh --zookeeper
localhost:2181 --topics-to-move-json-file topics-to-move.json --
broker-list "5,6" –generate
• Execute the plan : bin/kafka-reassign-partitions.sh --zookeeper
localhost:2181 --reassignment-json-file expand-cluster-
reassignment.json –execute
• Verify the execution : bin/kafka-reassign-partitions.sh --zookeeper
localhost:2181 --reassignment-json-file expand-cluster-
reassignment.json --verify
22. Data Assurance
• No data loss or no reordering
– Critical for applications like DB replication
– Can Kafka do this ? Yes !
• Cause of data loss on producer side
– setting block.on.buffer.full=false
– retires exhausting
– sending messages with out ack=all
• How can you fix ?
– set block.on.buffer.full=true
– set retired to Long.MAX_VALUE
– set acks to all
– have resend in your call back function (producer.send(record, callback))
23. Data Assurance - Continued
• Cause of data loss on consumer side
– offsets are carelessly committed
– data loss can happen if consumer committed the offset, but died while processing the
message
• Fixing data loss on consumer side
– commit offset only after processing of the message is completed
– disable auto.offset.commit
• Fixing on Broker Side
– have replication factor >= 3
– have min.isr 2
– disable unclean leader election
24. Data Assurance - Continued
• Message reordering
– If more than one message is in transit
– and also retry is enabled
• Fixing message reordering
– set max.in.flight.requests.per.connection=1
25. Kafka 0.9 (Beta release)
• Security
– Kerberos or TLS based authentication
– Unix like permission to restrict who can access data
– Encryption on the wire Via SSL
• Kafka Connect
– support large-scale real-time import and export for Kafka
– takes care of fault tolerance, offset management and delivery management
– will be supporting connectors for Hadoop and database
• User defined quota
– To manage abusive clients
– rate limit traffic or producer side and consumer side
26. Kafka 0.9 (Beta release)
– Allows only 10MBps for read and 5MBps for write
– If clients violate, slows down
– Can be overridden
• New Consumer
– Removes distinction between high level consumer and simple consumer
– Unified consumer API
– No longer zookeeper dependent
– Offers pluggable offset management
27. How Can You Get Involved?
•http://kafka.apache.org
•Join the mailing lists
–users@kafka.apache.org
• irc.freenode.net - #apache-kafka
27
28. Q & A
Want to contact us ?
Akash Vacher (avacher@linkedin.com)
Ayyappadas Ravindran (appu@linkedin.com)
Talent Partner : Syed Hussain (sshussain@linkedin.com)
Mob : +91 953 581 8876
Editor's Notes
Kafka is a publish-subscribe messaging system, in which there are four components:
- Broker (what we call the Kafka server)
- Zookeeper (which serves as a data store for information about the cluster and consumers)
- Producer (sends data into the system)
- Consumer (reads data out of the system)
Data is organized into topics (here we show a topic named “A”) and topics are split into partitions (we have partitions 0 and 1 here).
A “message” is a discrete unit of data within Kafka. Producers create messages and send them into the system. The broker stores them, and any number of consumers can then read those messages.
In order to provide scalability, we have multiple brokers. By spreading out the partitions, we can handle more messages in any topic.
This also provides redundancy. We can now replicate partitions on separate brokers. When we do this, one broker is the designated “leader” for each partition. This is the only broker that producers and consumers connect to for that partition. The brokers that hold the replicas are designated “followers” and all they do with the partition is keep it in sync with the leader.
When a broker fails, one of the brokers holding an in-sync replica takes over as the leader for the partition. The producer and consumer clients have logic built-in to automatically rebalance and find the new leader when the cluster changes like this. When the original broker comes back online, it gets its replicas back in sync, and then it functions as the follower. It does not become the leader again until something else happens to the cluster (such as a manual change of leaders, or another broker going offline).
In the previous slide we have seen that zookeeper is an integral part of Kafka echo system
So lets see what is zookeeper, Zookeeper is a distributed coordination service for distributed application
Zookeeper is also used for configuration maintenance
Zookeeper exposes simple APIs, using which application can build high level coordination service
Zookeeper guarantees ordering, atomicity and reliability
Zookeeper is implemented using a shared hierarchal name space. Implemented in the model of a shared Linux file system
Every node in zookeeper is called a znode
Znode is similar to file & it stores data, it has ACL and stat information
Two important concepts in Zookeeper echo system is Ephemeral Nodes & Watches
Ephemeral node exists as long as the session that created the ephemeral node exists
Client can set watches on Znode, client is informed when there is a change in znode
Now lets quickly see coordination service in action. A leader election
Consider that you have multiple clients competing to become the leader.
The challenge is on how to elect a leader
Zookeeper can be used for leader election
Znodes are created by setting SEQUENCE and EPHEMERAL flag
By setting SEQUENCE flag, each node is created with a monotonically increasing number to the end of the path
The client which manages to create the znode with lowest SEQUENCE ID is elected as leader
The znode created is an ephemeral znode, so the znode exists as long as the leader exists.
Now Lets see how is zookeeper used in Kafka environment
Kafka uses zookeeper both for storing configuration information and also for coordination service (Leader election & executing administrative tasks )
Zookeeper is used to store meta data information of broker, topics, consumers
When a broker comes live, it registers itself with ZK. Creates a znode & stores the broker ID, hostname and end point details in znode
Two type of topic related information is stored in ZK. One is the broker related topic information. Which broker hosts which topic/partition & replication information and have information about which replicas are leader and which are followers
Second information related to topic is the config information, this stores per topic configuration information like, retention, clean-up policies etc
Zookeeper also stores consumer information, like which consumers are consuming from which partition and till what data (in the log) a consumer has consumed , i.e offset information
Coming to the co-ordination service. One of the brokers in Kafka cluster, assigns the role of controller. The controller is responsible for managing the state of brokers, partition and replicas. Controller also performs the administrative tasks.
Controller election is done using zookeeper.
We run zookeeper on 3.4. Its not just used for Kafka but also for other critical applications in Linkedin.
We have a cluster size of 5 +1, where 5 are voting members and 1 non-voting member called observer. Primary role of observer is for disaster recovery and also helps in read scalability.
We make sure the nodes are in different racks. This is for ensuring power redundancy, we have bond0 (balance-rr bonding). This provides load balancing and fault tolerance.
If your system is write heavy its good to have better disk performance, we have SSD for keeping transaction logs. Or at least have a separate drive for transactions logs other than the drive which is used for applications logs and snapshots.
Do not over build your cluster, as the cluster size increases the latency for ZK writes transactions increases
Alright, we have seen zookeeper, now lets talks about brokers
Brokers are the nodes which run kafka process on it
Brokers store commit logs for topics/partitions
Brokers register themselves with zookeeper when they start
Multiple brokers create a kafka cluster
Cluster is good because they help in redundancy (replica), fault tolerance
Default replication policy which we have is 2, so we can afford one node failure
Can be horizontally scaled, as and when you want to expand clusters you need to add more brokers.
You can have better network usage and disk IO with multiple machines
Controller is a broker with additional responsibility
Controller is the brain of the cluster it’s a state machines.
We keep the data in zookeeper, when there is a state change controller acts on it
Controller manages the brokers, take care of the partitions and replications and does administrative tasks.
As said earlier in Kafka message is a discrete unit of data
Messages are stored in commit logs
Commit logs are distributed, partitioned and replicated
Message contains headers and payload
Header contains information like, size of the payload, crc32 check sum, compression used (snappy or gzip)
Leaving payload as byte array, give a lot of flexibility
In Linkedin our messages are avro formatted message
Commit logs are in stored in sequence files
Sequence files are stored under the folder named after topic-partition
Sequence files contains logs entries
The header size is 4 byte
The payload can be of variable size. We at Linkedin caps it at 1 MB
Messages in the commit log is identified using offset number
An offset number is 64 byte, it represents the position of the message from the start of the stream. Ie start of that topic partition
Segments are named after the first offset in the segment
Write happens to the tail end of the latest segment
Messages are written to OS page cache and its flushed to disk either based on the number of messages or a period of time
While reading from log consumers provide the offset number and chunk size
Kafka returns an iterator, which contains a message set
Ideally the chunk size will have multiple messages
There can be a corner case in which the message is larger than the chunk size provided. In that case, the consumer doubles its chunk size and retries
On consumer failures (here failure means consumer trying to fetch an offset which doesn’t exits), that is when consumer tries to read an offset which doesn’t exists, consumer can fail or consumer has the option to reset offset to start or current offset
Now you have the commit logs which keep on getting data, at one point in time this is gonna fill you disk, so you need a retention policy to rotate and purge the log
Kafka provides two clean up policies, You can either rotate the logs or compact the logs
Rotation can happen based on time or size
Log compaction is interesting, here we don’t purge an entire segment of log. Instead we remove logs having the same key and just retain the latest entry. Compaction can only happen with sematic partitioning
You can have per topic retention policy. Kafka ships a CLI, which can be used to set this value
Application which writes data into kafka topic is called producer
Producer code need to be given the details of the broker from where the producer can fetch meta data. Meta data contains the information about the brokers and broker ID where leader partition for the topic resides
Serialization class is pluggable, you can specify encoder class. In linkedin we use avro serialization
Partitioner class specifies how the messages should be partitioned Or to which partition a message should be written to.
Request.requires.acks specifies whether producer need to wait for an ack from broker or not. It has 3 values, 0-don’t wait, 1 get at least ack from leader, -1 or all get ack from all followers as well
When producer sends messages to broker, you can ask producer to batch multiple messages and send it in one go. This way you can compress the messages and also lesser overhead in terms of creating connections to the broker
Different type of compression are supported like gzip, snappy and lz4
Sticky partition is specific to Linkedin, this make sure that we are sending messages only to one partition for a given period of time. This way we can reduce the connection count
So we talked about messages
In Linkedin we send messages in avro format
Avro message contains schema of the data and data. Data is stored in binary format (serialized)
Linkedin custom producer adds extra information to each messages for tracking and auditing purpose
To save on storage and on n/w. Schema is stripped off from the message and is stored in a centralized location. Message has a schema ID to retrieve the schema
When the consumer wants to read the data it retrieves the schema from schema registry and reads the message
Schema is caches locally so as to reduce load on schema registry
We don’t want to break existing consumers so old backward compactible schemas are also stored in schema registry
Consumers are the process responsible for consuming from the topics. They subscribe to a topic will consume the messages and will process them
Kafka offers consumer abstraction called ‘Consumer Group’. A consumer group will have one or more consumer instances. Multiple consumer instances label themselves as consumer group
Traditionally consumers work either in queue mode or pub-sub mode. In queue mode each message is send to one consumer instance. In pub-sub mode, a message is send to all instance.
Messaging system guarantee ordering, but when delivered to multiple consumers asynchronously the messages may not be received in order. The work around is to use one single consumer, but in this approach there wont be any parallel consumption. Kafka does this via partitions. For an N partition topic you can have N consumers, this way one consumer will be consuming from one partition. This guarantee ordering and with multiple partition you can get parallelism.
High level consumers are multithreaded, manages offset for you. Has consumer groups. Does rebalancing, when new consumer instance joins or leaves consumer group
Simple consumer provides you greater control. You can read a subset of partitions, the messages can be read repeatedly. The drawback is that you need to deal with the offsets and need to find leader partitions
Consumers keeps the offset information in zookeeper. This is the point till where a particular consumer has consumed. In case if that consumer thread dies, it exactly knows from where need to consume again
Obviously you need to tell consumers from which topic it need to consume
You have an option to start consuming at a particular offset, or from start or end of the stream
Auto.offset.reset, remember mentioning that consumers store offset in zookeeper. Say in case if the consumer was not able to fetch the zookeeper or the consumer provided an offset which doesn’t exists, what should be the default behavior, whether the consumer should consume from the tail end or beginning. This is controlled by offset.rest
Consumer group is an abstraction, it help the consumer instance to consume message either in a queuing fashion or in pub-sub mode. Group.id is used to set consumer gourp name. Takes a string
Auto.commit.enabled : consumer store the value of offset which they have consumed. By enabling this consumer automatically sets offset
String that represents consumer group, should be unique
1. Kafka ships command line tools to manage the Kafka clusters. These tools are used for maintenance and debug, we will quickly go through this
Now lets see the operational challenges when a broker dies
Kafka does the leader fail over, so one of the followers in ISR becomes the new leader
You will end-up having corrupt index/log files
You will end-up having under replicated partitions. Obviously, since you lost one of the replicas
Kafka takes care of the corrupt index/log files. It discards incomplete log entry
URP’s are fixed when the broker comes up. Point to remember is that the replicas will come back as follower
This creates a challenge, now you have uneven leader distribution across your cluster
Kafka ships CLI, using which you can rebalance the leader distribution
There is also an option to automatically do, but its not very clean
-Partition Reassignment
-Broker leveling script moves data to even out data volume per broker
As I mentioned when you add a broker to a cluster it won’t be used by existing partitions.
With the 0.8.1 release of Kafka there is a new feature, Partition Reassignment! Now when you add a broker to the cluster, it can be used by your existing topics and partitions! Existing partitions can be moved around live and be completely transparent to all consumers and producers. We have developed a tool sits on top of the partition reassignment tool that will balance a cluster after you add new brokers, or if your cluster is simply unbalanced(there are many ways you can wind up in this state). What it does is it goes out to each broker and figures out how big each partition is(on disk), and the total amount of storage used on each broker. Next it starts calling the partition reassignment tool to make the larger brokers smaller, and the smaller brokers larger. It stops once the overall datasize is within 1GB between the smallest and largest brokers. This is just one example of the many possible ways to optimize a cluster with the partition reassignment tool.
Expanding kafka cluster is very simple. You need to create brokers with unique broker ID and start Kafka server. The server will be automatically added to the cluster
But on adding new brokers won’t trigger automatic balancing in kafka cluster. An admin need to move topic to the new broker. Only the initiation is manual, the process is automated
Is not a big deal in application like pageview event tracking, Loss of one or two messages in a million messages is fine
Becomes critical for applications like DB replication and transactions which involve money
Where all can the loss happen, it can happen on producer end , consumer side and broker side
Lets see issues on each side
Cause of data loss on producer end
Setting block on buffer full to false will throw and error and discards the messages
Another cause can be number of retries being exhausted.
Whether you are running cluster in asyn mode or syn mode. In sync mode are you waiting for a commit message from all replicas or not ?
When setting block on buffer full as true, producer won’t take any more message when its buffer is full
If you set retries to long.Max_value, it will retry for 2^63 -1 times
Set ack to all
Cause of data on consumer end is because you are careless ! Just kidding
This can happen if you consume the messages and commit the offset before really processing the message. You can have failure during processing
How do you fix this ?
One commit offset after processing the message
Disable auto.offset.commit
Data need to be moved in an out of Kafka and other systems
People uses multiple solutions,
So how can you get more involved in the Kafka community?
The most obvious answer is to go apache.kafka.org. From there you can
Join the mailing lists, either on the development or the user side
You can also dive into the source repository, and work on and contribute your own tools back.
Kafka may be young, but it’s a critical piece of data infrastructure for many of us.