Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Multi tier, multi-tenant, multi-problem kafka

3,093 views

Published on

At LinkedIn, the Kafka infrastructure is run as a service: the Streaming team develops and deploys Kafka, but is not the producer or consumer of the data that flows through it. With multiple datacenters, and numerous applications sharing these clusters, we have developed an architecture with multiple pipelines and multiple tiers. Most days, this works out well, but it has led to many interesting problems. Over the years we have worked to develop a number of solutions, most of them open source, to make it possible for us to reliably handle over a trillion messages a day.

Published in: Engineering
  • Be the first to comment

Multi tier, multi-tenant, multi-problem kafka

  1. 1. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Multi-Tier, Multi-Tenant, Multi-Problem Kafka
  2. 2. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Todd Palino
  3. 3. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Who Am I? 3
  4. 4. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. What Will We Talk About?  Multi-Tenant Pipelines  Multi-Tier Architecture  Why I Drink Interesting Problems  Conclusion 4
  5. 5. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Multi-Tenant Pipelines 5
  6. 6. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Tracking and Data Deployment  Tracking – Data going to HDFS  Data Deployment – Hadoop job results going to online applications  Many shared topics  Schemas require a common header  All message counts are audited  Special Problems – Hard to tell what application is dropping messages – Some of these messages are copied 42 times! 6
  7. 7. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Metrics  Application and OS metrics  Deployment and build system events  Service calls – sampling of timing information for individual application calls  Some application logs  Special Problems – Every server in the datacenter produces to this cluster at least twice – Graphing/Alerting system consumes the metrics 20 times 7
  8. 8. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Logging  Application logging messages destined for ELK clusters  Lower retention than other clusters  Loosest restrictions on message schema and encoding  Special Problems – Not many – it’s still overprovisioned – Customers starting to ask about aggregation 8
  9. 9. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Queuing  Everything else  Primarily messages internal to applications  Also emails and user messaging  Messages are Avro encoded, but do not require headers  Special Problems: – Many messages which use unregistered schemas – Clusters can have very high message rates (but not large data) 9
  10. 10. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Special Case Clusters  Not all use cases fit multi-tenancy – Custom configurations that are needed – Tighter performance guarantees – Use of topic deletion  Espresso (KV store) internal replication  Brooklin – Change capture  Replication from Hadoop to Voldemort 10
  11. 11. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Tiered Cluster Architecture 11
  12. 12. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. One Kafka Cluster 12
  13. 13. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Multiple Clusters – Message Aggregation 13
  14. 14. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Why Not Direct?  Network Concerns – Bandwidth – Network partitioning – Latency  Security Concerns – Firewalls and ACLs – Encrypting data in transit  Resource Concerns – A misbehaving application can swamp production resources 14
  15. 15. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. What Do We Lose?  You may lose message ordering – Mirror maker breaks apart message batches and redistributes them  You may lose key to partition affinity – Mirror maker will partition based on the key – Differing partition counts in source and target will result in differing distribution – Mirror maker does not (without work) honor custom partitioning 15
  16. 16. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Aggregation Rules  Aggregate clusters are only for consuming messages – Producing to an aggregate cluster is not allowed – This assures all aggregate clusters have the same content  Not every topic appears in PROD aggregate-tracking clusters – Trying to discourage aggregate cluster usage in PROD – All topics are available in CORP  Aggregate-queuing is whitelist only and very restricted – Please discuss your use case with us before developing 16
  17. 17. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Interesting Problems 17
  18. 18. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Buy The Book! 18 Early Access available now. Covers all aspects of Kafka, from setup to client development to ongoing administration and troubleshooting. Also discusses stream processing and other use cases.
  19. 19. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Monitoring Using Kafka  Monitoring and alerting are self-service – No gatekeeper on what metrics are collected and stored  Applications use a common container – EventBus Kafka producer – Simple annotation of metrics to collect – Sampled service calls – Application logs  Everything is produced to Kafka and consumed by the monitoring infrastructure 19
  20. 20. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Monitoring Kafka  Kafka is great for monitoring your applications 20
  21. 21. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. KMon and EnlightIN  Developed a separate monitoring and notification system – Metrics are only retained long enough to alert on them – One rule: we can’t use Kafka  Alerting is simplified from our self-service system – Nothing complex like regular expressions or RPNs – Only used for critical Kafka and Zookeeper alerts – Faster and more reliable  Notifications are cleaner – Alerts are grouped into incidents for fewer notifications when things break – Notification system is generic and subscribable so we can use it for other things 21
  22. 22. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Broker Monitoring  Bytes In and Out, Messages In – Why not messages out?  Partitions – Count and Leader Count – Under Replicated and Offline  Threads – Network pool, Request pool – Max Dirty Percent  Requests – Rates and times - total, queue, local, and send 22
  23. 23. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Is Kafka Working?  Knowing that the cluster is up isn’t always enough – Network problems – Metrics can lie  Customers still ask us first if something breaks – Part of the solution is educating them as to what to monitor – Need to be absolutely sure of the answer “There’s nothing wrong with Kafka” 23
  24. 24. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Monitoring Framework  Producer to consumer testing of a Kafka cluster – Assures that producers and consumers actually work – Measures how long messages take to get through  We have a SLO of 99.99% availability for all clusters  Working on multi-tier support – Answers the question of how long messages take to get to Hadoop  LinkedIn Kafka Open Source – https://github.com/linkedin/streaming 24
  25. 25. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Is Mirroring Working?  Most critical data flows through Kafka – Most of that depends on mirror makers – How do we make sure it all gets where it’s going?  Mirror maker pipelines can have over a thousand topics – Different message rates – Some are more important than others  Lag threshold monitoring doesn’t work – Traffic spikes cause false alerts – What should the threshold be? – No easy way to monitor 1000 topics and over 10k partitions 25
  26. 26. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Audit  Audit tracks topic completeness across all clusters in the pipeline – Primarily tracking messages – Schema must have a valid header – Alerts for DWH topics are set for 0.1% message loss  Provided as an integrated part of the internal Kafka libraries  Used for data completeness checks before Hadoop jobs run 26
  27. 27. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Auditing Message Flows 27
  28. 28. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Burrow  Burrow is an advanced Kafka consumer monitoring system – Provides an objective view of consumer status – Much more powerful than threshold-based lag monitoring  Burrow is Open Source! – Used by many other companies, including Wikimedia and Blizzard – Used internally to assure all Mirror Makers and Audit are running correctly  Exports metrics for all consumers to self-service monitoring  https://github.com/linkedin/Burrow 28
  29. 29. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. MTTF Is Not Your Friend  We have over 1800 Kafka brokers – All have at least 12 drives, most have 16 – Dual CPUs, at least 64 GB of memory – Really lousy Megaraid controllers  This means hardware fails daily – We don’t always know when it happens, if it doesn’t take the system down – It can’t always be fixed immediately – We can take one broker down, but not two 29
  30. 30. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Moving Partitions  Prior to Kafka 0.8, moving partitions was basically impossible – It’s still not easy – you have to be explicit about what you are moving – There’s no good way to balance partitions in a cluster  We developed kafka-assigner to solve the problem – A single command to remove a broker and distribute it’s partitions – Chainable modules for balancing partitions – Open source! https://github.com/linkedin/kafka-tools  Also working on “Cruise Control” for Kafka – An add-on service that will handle redistributing partitions automatically 30
  31. 31. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Pushing Data from Hadoop  To help Hadoop jobs, we maintain a KafkaPushJob – A mapper that produces messages to Kafka – Pushes to data-deployment, which then gets mirrored to production  Hadoop jobs tend to push a lot of data all at once – Some jobs spin up hundreds of mappers – Pushing many gigabytes of data in a very short period of time  This overwhelms a Kafka cluster – Spurious alerts for under replicated partitions – Problems with mirroring the messages out 31
  32. 32. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Quotas  Quotas limit traffic based on client ID – Specified in bytes/sec on a per-broker basis – Not per-topic or per-partition  Should be transparent to clients – Accomplished by delaying the response to requests – Newer clients have metrics specific to quotas for clarity  We use it to protect the replication of the cluster – Set it as high as possible while protecting against a single bad client 32
  33. 33. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Delete Topic  Feature has been under development for almost 3 years – Only recently has it even worked a little bit – We’re still not sure about it (from SRE’s point of view)  Recently performed additional testing so we can use it – Found that even when disabled for a cluster, something was happening – Some brokers claimed the topic was gone, some didn’t – Mirror makers broke for the topic  One of the code paths in the controller was not blocked – Metadata change went out, but it was hard to diagnose 33
  34. 34. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Brokers are Independent  When there’s a problem in the cluster, brokers might have bad information – The controller should tell them what the topic metadata is – Brokers get out of sync due to connection issues or bugs  There’s no good tool for just sending a request to a broker and reading the response – We had to write a Java application just to send a metadata request  Coming soon – kafka-protocol – Simple CLI tool for sending individual requests to Kafka brokers – Will be part of the https://github.com/linkedin/kafka-tools repository 34
  35. 35. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Conclusion 35
  36. 36. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Broker Improvement - JBOD  We use RAID-10 on all brokers – Trade off a lot of performance for a little resiliency – Lose half of our disk space  Current JBOD implementation isn’t great – No admin tools for moving partitions – Assignment is round-robin – Broker shuts down if a single disk fails  Looking at options – Might try to fix the JBOD implementation in Kafka – Testing running multiple brokers on a single server 36
  37. 37. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Mirror Maker Improvements  Mirror Maker has performance issues – Has to decompress and recompress every message – Loses information about partition affinity and strict ordering  Developed an Identity message handler – Messages in source partition 0 get produced directly to partition 0 – Requires mirror maker to maintain downstream partition counts  Working on the next steps – No decompression of message batches – Looking at other options on how to run mirror makers 37
  38. 38. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Administrative Improvements  Multiple cluster management – Topic management across clusters – Visualization of mirror maker paths  Better client monitoring – Burrow for consumer monitoring – No open source solution for producer monitoring (audit)  End-to-end availability monitoring 38
  39. 39. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Getting Involved With Kafka  http://kafka.apache.org  Join the mailing lists – users@kafka.apache.org – dev@kafka.apache.org  irc.freenode.net - #apache-kafka  Meetups – Bay Area – https://www.meetup.com/Stream-Processing-Meetup-LinkedIn/  Contribute code 39

×