Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kafka at Peak Performance

2,599 views

Published on

Big Data means big hardware, and the less of it we can use to do the job properly, the better the bottom line. Apache Kafka makes up the core of our data pipelines at many organizations, including LinkedIn, and we are on a perpetual quest to squeeze as much as we can out of our systems, from Zookeeper, to the brokers, to the various client applications. This means we need to know how well the system is running, and only then can we start turning the knobs to optimize it. In this talk, we will explore how best to monitor Kafka and its clients to assure they are working well. Then we will dive into how to get the best performance from Kafka, including how to pick hardware and the effect of a variety of configurations in both the broker and clients. We’ll also talk about setting up Kafka for no data loss.

Published in: Engineering
  • Nice presentation Todd Palino. I like your advice about sticking with the defaults. However, in case of produce, we need to provide SLA gurantees with ack all. In this case, if the request timeout is default 30 secs, caller will timeout if it takes that long and tries to republish. Is there a way we can tune it better to suit our use case?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Kafka at Peak Performance

  1. 1. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka at Peak Performance
  2. 2. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Todd Palino
  3. 3. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Who Am I? 3
  4. 4. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka At LinkedIn  1100+ Kafka brokers  Over 32,000 topics  350,000+ Partitions  875 Billion messages per day  185 Terabytes In  675 Terabytes Out  Peak Load (whole site) – 10.5 Million messages/sec – 18.5 Gigabits/sec Inbound – 70.5 Gigabits/sec Outbound 4  1800+ Kafka brokers  Over 79,000 topics  1,130,000+ Partitions  1.3 Trillion messages per day  330 Terabytes In  1.2 Petabytes Out  Peak Load (single cluster) – 2 Million messages/sec – 4.7 Gigabits/sec Inbound – 15 Gigabits/sec Outbound
  5. 5. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. What Will We Talk About?  Picking Your Hardware  Monitoring the Cluster  Triaging Broker Performance Problems  Conclusion 5
  6. 6. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Hardware Selection 6
  7. 7. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. What’s Important To You?  Message Retention - Disk size  Message Throughput - Network capacity  Producer Performance - Disk I/O  Consumer Performance - Memory 7
  8. 8. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Go Wide  Kafka is well-suited to horizontal scaling  RAIS - Redundant Array of Inexpensive Servers  Also helps with CPU utilization – Kafka needs to decompress and recompress every message batch – KIP-31 will help with this by eliminating recompression  Don’t co-locate Kafka 8
  9. 9. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Disk Layout  RAID – Can survive a single disk failure (not RAID 0) – Provides the broker with a single log directory – Eats up disk I/O  JBOD – Gives Kafka all the disk I/O available – Broker is not smart about balancing partitions – If one disk fails, the entire broker stops  Amazon EBS performance works! 9
  10. 10. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Operating System Tuning  Filesystem Options – EXT or XFS – Using unsafe mount options  Virtual Memory – Swappiness – Dirty Pages  Networking 10
  11. 11. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Java  Only use JDK 8 now  Keep heap size small – Even our largest brokers use a 6 GB heap – Save the rest for page cache  Garbage Collection - G1 all the way – Basic tuning only – Watch for humongous allocations 11
  12. 12. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. How Much Do You Need? 12
  13. 13. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Buy The Book! 13 Early Access available now. Covers all aspects of Kafka, from setup to client development to ongoing administration and troubleshooting. Also discusses stream processing and other use cases.
  14. 14. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Cluster Sizing  How big for your local cluster? – How much disk space do you have? – How much network bandwidth do you have? – CPU, memory, disk I/O  How big for your aggregate cluster? – In general, multiple the number of brokers by the number of local clusters – May have additional concerns with lots of consumers 14
  15. 15. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Topic Configuration  Partition Counts for Local – Many theories on how to do this correctly, but the answer is “it depends” – How many consumers do you have? – Do you have specific partition requirements? – Keeping partition sizes manageable  Partition Counts for Aggregate – Multiply the number of partitions in a local cluster by the number of local clusters – Periodically review partition counts in all clusters  Message Retention – If aggregate is where you really need the messages, only retain it in local for long enough to cover mirror maker problems 15
  16. 16. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Possible Broker Improvements  Namespaces – Namespace topics by datacenter – Eliminate local clusters and just have aggregate – Significant hardware savings  JBOD Fixes – Intelligent partition assignment – Admin tools to move partitions between mount points – Broker should not fail completely with a single disk failure 16
  17. 17. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Administrative Improvements  Multiple cluster management – Topic management across clusters – Visualization of mirror maker paths  Better client monitoring – Burrow for consumer monitoring – No open source solution for producer monitoring (audit)  End-to-end availability monitoring 17
  18. 18. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Keeping An Eye On Things 18
  19. 19. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Monitoring The Foundation  CPU Load  Network inbound and outbound  Filehandle usage for Kafka  Disk – Free space - where you write logs, and where Kafka stores messages – Free inodes – I/O performance - at least average wait and percent utilization  Garbage Collection 19
  20. 20. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Broker Ground Rules  Tuning – Stick (mostly) with the defaults – Set default cluster retention as appropriate – Default partition count should be at least the number of brokers  Monitoring – Watch the right things – Don’t try to alert on everything  Triage and Resolution – Solve problems, don’t mask them 20
  21. 21. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Too Much Information!  Monitoring teams hate Kafka – Per-Topic metrics – Per-Partition metrics – Per-Client metrics  Capture as much as you can – Many metrics are useful while triaging an issue  Clients want metrics on their own topics  Only alert on what is needed to signal a problem 21
  22. 22. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Broker Monitoring  Bytes In and Out, Messages In – Why not messages out?  Partitions – Count and Leader Count – Under Replicated and Offline  Threads – Network pool, Request pool – Max Dirty Percent  Requests – Rates and times - total, queue, local, and send 22
  23. 23. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Topic Monitoring  Bytes In, Bytes Out  Messages In, Produce Rate, Produce Failure Rate  Fetch Rate, Fetch Failure Rate  Partition Bytes  Log End Offset – Why bother? – KIP-32 will make this unnecessary  Quota Throttling  Provide this to your customers for them to alert on 23
  24. 24. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Client Monitoring  For consumers, use Burrow – Monitor all partitions for all consumers – Provides an easy to digest “good, warning, bad” state, with detail available – Fast and free  Producers are a little harder – Several internal implementations of message auditing – The community needs a good open source standard  Cluster availability monitoring – kafka-monitoring is coming soon from LinkedIn! 24
  25. 25. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. It’s Broken! Now What? 25
  26. 26. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. All The Best Ops People…  Know more of what is happening than their customers  Are proactive  Fix bugs, not work around them  This applies to our developers too! 26
  27. 27. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Anticipating Trouble  Trend cluster utilization and growth over time  Use default configurations for quotas and retention to require customers to talk to you  Monitor request times – If you are able to develop a consistent baseline, this is early warning 27
  28. 28. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Under Replicated Partitions  Count of number of partitions which are not fully replicated within the cluster  Also referred to as “replica lag”  Primary indicator of problems within the cluster 28
  29. 29. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Broker Performance Checks  Are you still running 0.8?  Are all the brokers in the cluster working?  Are the network interfaces saturated? – Reelect partition leaders – Rebalance partitions in the cluster – Spread out traffic more (increase partitions or brokers)  Is the CPU utilization high? (especially iowait) – Is another process competing for resources? – Look for a bad disk  Do you have really big messages? 29
  30. 30. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka’s OK, Now What?  If Kafka is working properly, it’s probably a client issue – Don’t throw it over the fence. Help your customers understand  Common producer issues – Batch size and linger time – Receive and send buffers – Sync vs. async, and acknowledgements  Common consumer issues – Garbage collection problems – Min fetch bytes and max wait time – Not enough partitions 30
  31. 31. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Conclusion 31
  32. 32. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. One Ecosystem  Kafka can scale to millions of messages per second, and more – Operations must scale the cluster appropriately – Developers must use the right tuning and go parallel  Few problems are owned by only one side – Expanding partitions often requires coordination – Applications that need higher reliability drive cluster configurations  Either we work together, or we fail separately 32
  33. 33. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Would You Like To Know More?  Presentations: http://www.slideshare.net/toddpalino – More Datacenters, More Problems – Kafka As A Service – Always download the originals for slide notes!  Blog Posts: https://engineering.linkedin.com/blog – Development and SRE blogs on Kafka and other topics  LinkedIn Open Source: https://github.com/linkedin/streaming – Burrow Consumer Monitoring - https://github.com/linkedin/Burrow – Kafka Admin Tools - https://github.com/linkedin/kafka-tools 33
  34. 34. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Getting Involved With Kafka  http://kafka.apache.org  Join the mailing lists – users@kafka.apache.org – dev@kafka.apache.org  irc.freenode.net - #apache-kafka  Meetups – Apache Kafka - http://www.meetup.com/http-kafka-apache-org – Bay Area Samza - http://www.meetup.com/Bay-Area-Samza-Meetup/  Contribute code 34
  35. 35. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Data @ LinkedIn is Hiring!  Streams Infrastructure – Kafka pub/sub ecosystem – Stream Processing Platform built on Apache Samza – Next Generation change capture technology (incubating)  LinkedIn – Strong commitment to open source – Do cool things and work with awesome people  Join us in working on cutting edge stream processing infrastructures – Please contact kparamasivam@linkedin.com – Software developers and Site Reliability Engineers at all levels 35
  36. 36. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Appendix 37
  37. 37. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. JDK Options Heap Size -Xmx6g -Xms6g Metaspace -XX:MetaspaceSize=96m -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80 G1 Tuning -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M GC Logging -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:/path/to/logs/gc.log -verbose:gc Error Handling -XX:-HeapDumpOnOutOfMemoryError -XX:ErrorFile=/path/to/logs/hs_err.log 38
  38. 38. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. OS Tuning Parameters  Networking: net.core.rmem_default = 124928 net.core.rmem_max = 2048000 net.core.wmem_default = 124928 net.core.wmem_max = 2048000 net.ipv4.tcp_rmem = 4096 87380 4194304 net.ipv4.tcp_wmem = 4096 16384 4194304 net.ipv4.tcp_max_tw_buckets = 262144 net.ipv4.tcp_max_syn_backlog = 1024 39
  39. 39. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. OS Tuning Parameters (cont.)  Virtual Memory vm.oom_kill_allocating_task = 1 vm.max_map_count = 200000 vm.swappiness = 1 vm.dirty_writeback_centisecs = 500 vm.dirty_expire_centisecs = 500 vm.dirty_ratio = 60 vm.dirty_background_ratio = 5 40
  40. 40. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Broker Sensors kafka.server:name=BytesInPerSec,type=BrokerTopicMetrics kafka.server:name=BytesOutPerSec,type=BrokerTopicMetrics kafka.server:name=MessagesInPerSec,type=BrokerTopicMetrics kafka.server:name=PartitionCount,type=ReplicaManager kafka.server:name=LeaderCount,type=ReplicaManager kafka.server:name=UnderReplicatedPartitions,type=ReplicaManager kafka.server:name=RequestHandlerAvgIdlePercent,type=KafkaRequestHandlerPool kafka.controller:name=ActiveControllerCount,type=KafkaController kafka.controller:name=OfflinePartitionsCount,type=KafkaController kafka.log:name=max-dirty-percent,type=LogCleanerManager kafka.network:name=NetworkProcessorAvgIdlePercent,type=SocketServer kafka.network:name=RequestsPerSec=*,type=RequestMetrics kafka.network:name=RequestQueueTimeMs,request=*,type=RequestMetrics kafka.network:name=LocalTimeMs,request=*,type=RequestMetrics kafka.network:name=RemoteTimeMs,request=*,type=RequestMetrics kafka.network:name=ResponseQueueTimeMs,request=*,type=RequestMetrics kafka.network:name=ResponseSendTimeMs,request=*,type=RequestMetrics kafka.network:name=TotalTimeMs,request=*,type=RequestMetrics 41
  41. 41. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Broker Sensors - Topics kafka.server:name=BytesInPerSec,type=BrokerTopicMetrics,topics=* kafka.server:name=BytesOutPerSec,type=BrokerTopicMetrics,topics=* kafka.server:name=MessagesInPerSec,type=BrokerTopicMetrics,topics=* kafka.server:name=TotalProduceRequestsPerSec,type=BrokerTopicMetrics,topic=* kafka.server:name=FailedProduceRequestsPerSec,type=BrokerTopicMetrics,topic=* kafka.server:name=TotalFetchRequestsPerSec,type=BrokerTopicMetrics,topic=* kafka.server:name=FailedFetchRequestsPerSec,type=BrokerTopicMetrics,topic=* kafka.log:type=Log,name=LogEndOffset,topic=*,partition=* 42

×