Apache kafka- Onkar Kadam


Published on

Apache Kafka

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Apache kafka- Onkar Kadam

  2. 2. WHAT IS KAFKA?• KAFKA is a distributed messaging system that wasoriginally developed at LinkedIn to serve as thefoundation for LinkedIns activity stream and operationaldata processing pipeline. It is now used at a variety ofdifferent companies for various data pipeline andmessaging uses.• LinkedIn developed Kafka for collecting and deliveringhigh volumes of log data with low latency for real-timelog processing.• Scala and Zookeeper based.
  3. 3. WHAT IS KAFKA? CONTD..• KAFKA can be used for both ONLINE(realtime) andOFFLINE (integration with hadoop) analysis of log data.• Distributed and scalable .• High Throughput.• More Priority to efficiency then fancy features.• Simple API.• Low Overhead.
  5. 5. Where to use APACHE KAFKA?• user activity events corresponding to logins, page-views,clicks, “likes” ,sharing, comments, and search queries• operational metrics (health monitoring)• search relevance.• Recommendations.• Ad targeting and reporting.• Protection against abnormal behavior.• newsfeed• Etc.
  6. 6. FLAWS in Traditional Messaging Systems.• More focus on delivery guarantee.• Less focus on throughput.• Weak in distributed support.• Performance degrades as Queue increases.
  9. 9. ARCHITECTURE contd..• A stream of messages of particular type is defined by a topic.• A topic is then divided into multiple partitions• Each partition is a directory that has list of files and the files storethe message data.• A producer can publish messages to a topic (push).• Published messages are then stored on a set of servers calledbrokers.• Each broker stores one or more partitions• Consumers can subscribe to any of the topics and can consumemessages by pulling data from the brokers.• Supports both Point-to-point Delivery model and Publish/subscribeDelivery model.• Uses “Pull Based” system model.
  10. 10. WHY “PULL MODEL”?• A “push-based” system has difficulty dealing with diverseconsumers as the broker controls the rate at which datais transferred, (DOS attack = loss of data).• In a Pull based system, a consumer simply falls behindand catches up when it can.• Consumer can rewind back to consume old messages.
  11. 11. DEPLOYMENT
  12. 12. Storage• Simple Storage• One each partition corresponds to a logical log.(a log ==a list of files.)• Physically a log is implemented using segmentationwhere each segment is approximately of equal size.• When a message is published , the broker simplyappends the message to the last segment file.• Messages are flushed on to disk in a batch for betterperformance.• Messages is only exposed to consumers after they areflushed.• Messages are addressed by log offset. #lessoverhead.
  13. 13. Storage contd..• Id of next message = length of current message + id of currentmessage.
  14. 14. Efficient Transfer• Producer can submit multiple messages in a single sendrequest.( #End-to-endmessagebatching).• Each pull request can consume multiple messages up toa certain size.• No caching of messages on the Kafta process layer.Messages are only cached in the page cache, --“nodouble buffering”. Hence very little overhead in garbagecollecting its memory.(#filesystemcaching)#lessoverhead• Producer and Consumer access the data filesSequentially. (disks are fast when accessedsequentially).
  15. 15. Efficient Transfer :from local file to remotesocket1. read data from the storage media to the page cache in an OS.2. copy data in the page cache to an application buffer.3. copy application buffer to another kernel buffer.4. send the kernel buffer to the socket• This process usually takes 4 data copying and one system call.SENDFILE APIavoids 2 copies call and 1 system call#zerocopytransferbroker consumer
  16. 16. Stateless Broker• Consumer maintains its own state hence reducingcomplexity and overhead on the broker.• Disadvantage is that broker doesn’t know whether allsubscriber’s have consumed the message , this problemis solved by using a simple time based SLA for theretention policy. A message is automatically deleted if ithas been• A consumer can deliberately rewind back to an old offsetand re-consume data. #violateslogicofqueue.
  17. 17. Distributed ModelProducer 1 Producer 2Broker 1 Broker 2 Broker 3Consumer 1 Consumer 2Zookeeper
  18. 18. Distributed Model contd..• Consumer Group consists of one or more consumers.• All messages from a partition are consumed by a singleconsumer in a consumer group. #lessoverheadonbroker.• No master node  less complexity and no masterfailures to worry about.• Zookeeper co-ordinates between the producers,consumers and brokers.
  19. 19. Zookeeper functions.• Detection of addition/ removal of brokers, producers, consumers.• Triggering rebalance process in effect to above detection.• Tracking.Broker registry hostname, port andset of topics of broker.ephemeralConsumer Registry consumer group EphemeralOwner registry consumer currentlyconsuming a partitionpersistentOffset registry stores the offset of lastconsumed partition foreach subscribedpartition.persistent
  20. 20. Delivery Guarantee• At least once delivery. #cancauseduplication.#cost-effective• Messages from single partition in order.• Messages from multiple partitions not necessarily inorder. #noguarantee• CRC for each message in the logs #avoidlogcorruption• If I/O error on broker then kafka removes messages withinconsistent CRCs.• If the storage system is completely damaged andconsumer have not consumed the message then themessage is lost forever. #futureplanstoaddreplication.
  21. 21. Recent Developments• End-to-End Batch Level Compression.• Improved Stream Processing Libraries.• Hadoop Consumer.• Hadoop Producer.
  22. 22. Apache Kafka @
  23. 23. References.• http://incubator.apache.org/kafka/• http://research.microsoft.com/en-us/um/people/srikan• http://vimeo.com/27592622