Your SlideShare is downloading. ×
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Enterprise Kafka: Kafka as a Service

5,465
views

Published on

Kafka is a publish/subscribe messaging system that, while young, forms a vital core for data flow inside many organizations, including LinkedIn. We will discuss Kafka from an Operations point of view, …

Kafka is a publish/subscribe messaging system that, while young, forms a vital core for data flow inside many organizations, including LinkedIn. We will discuss Kafka from an Operations point of view, including the use cases for Kafka and the tools LinkedIn has been developing to improve the management of deployed clusters. We'll also talk about some of the challenges of managing a multi-tenant data service and how to avoid getting woken up at 3 AM.

NOTE: I highly recommend viewing the original PPT. It has copious speaker notes for each slide, and the animations will actually work properly.

Published in: Data & Analytics

0 Comments
38 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,465
On Slideshare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
205
Comments
0
Likes
38
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Enterprise Kafka
  • 2. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Why Am I Here?  You want to find out what this “Kafka” thing is  You’re running Kafka, but you want to go big  You’re looking for some neat whizbangs 2
  • 3. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Clark Haskins Todd Palino
  • 4. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Who Are We?  Kafka SRE at LinkedIn  Site Reliability Engineering – Administrators – Architects – Developers  Keep the site running, always 4
  • 5. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Kafka Overview 5
  • 6. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. What Is Kafka? 6
  • 7. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. What Is Kafka? Broker A P0 A P1 A P0 7 Consumer Producer Zookeeper
  • 8. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Attributes of a Kafka Cluster  Disk Based  Durable  Scalable  Low Latency  Finite Retention  NOT Idempotent (yet) 8
  • 9. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Kafka At LinkedIn  Multiple Datacenters, Multiple Clusters  Mirroring between clusters  Message Types – Metrics – Tracking – Queuing  Data transport from applications to Hadoop, and back 9
  • 10. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Kafka At LinkedIn 10
  • 11. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Kafka At LinkedIn  300+ Kafka brokers  Over 18,000 topics  140,000+ Partitions  220 Billion messages per day  40 Terabytes In  160 Terabytes Out  Peak Load – 3.25 Million messages per second – 5.5 Gigabits/sec Inbound – 18 Gigabits/sec Outbound 11
  • 12. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Challenges We Have Overcome 12
  • 13. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Solutions  Kafka is young…..we Influenced development  Operations wizardry… 13
  • 14. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Hyper Growth  Need to expand clusters to keep up with site traffic, and then balance them. 14
  • 15. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Adding brokers 15 Brokers Consumers Producers A P1 A P0 B P1 B P0 a P5 A P4 B P5 B P4 A P3 A P2 B P3 B P2 A P7 A P6 B P7 B P6 A P5 A P4 B P5 B P4 A P1 A P0 B P1 B P0 A P7 A P6 B P7 B P6 A P3 A P2 B P3 B P2 C P1 C P0 C P3 C P2 C P1 C P0 C P3 C P2
  • 16. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Adding a broker(with broker leveling) 16 Brokers Consumers Producers A P1 A P0 B P1 B P0 A P5 A P4 B P5 B P4 A P3 A P2 B P3 B P2 A P7 A P6 B P7 B P6 A P5 A P4 B P5 B P4 A P1 A P0 B P1 B P0 A P7 A P6 B P7 B P6 A P3 A P2 B P3 B P2 C P1 C P0 C P3 C P2 C P1 C P0 C P3 C P2
  • 17. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Logs vs. Metrics  Logging data killed the metrics cluster 17
  • 18. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Quality of Service with Kafka 18 Brokers Consumers Producers A P1 A P0 B P1 B P0 A P5 A P4 B P5 B P4 A P3 A P2 B P3 B P2 A P7 A P6 B P7 B P6 A P5 A P4 B P5 B P4 A P1 A P0 B P1 B P0 A P7 A P6 B P7 B P6 A P3 A P2 B P3 B P2 C P1 C P0 C P3 C P2 C P1 C P0 C P3 C P2
  • 19. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Deployment Nightmares  Parallel deployment wasn’t possible so…  Babysitting sequential deployments 19
  • 20. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Easy deployments  Kafka 0.8.1 makes sure the cluster is in a good state before shutting down – If any brokers in the cluster have under replicated partitions, Kafka will not shut down – Kafka ensures that only 1 broker is in shutdown sequence at a time. 20
  • 21. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Killing Zookeeper  Consumer offset management done within Zookeeper  Every consumer committing offsets every minute for every partition makes ZK very unhappy. 21
  • 22. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Zookeeper on SSD 22
  • 23. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Monitoring 23
  • 24. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Kafka Is Broken! 24
  • 25. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Kafka Is Broken!  Everything is Kafka’s fault first  What is lag?  Consumer Problems – Application problems – Kafka client problems 25
  • 26. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. How Do We Sleep At Night?  Educating Users – Why lag is their fault  Monitoring the Ecosystem – Kafka Brokers – Zookeeper – Mirror Makers – Audit – REST Interfaces  Week Over Week 26
  • 27. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Cluster Health and Utilization  Under replicated partitions  Offline partitions  Broker partition count  Data size on disk  Leader partition count  Network utilization 27
  • 28. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Zookeeper  Ensemble availability  Latency  Outstanding requests 28
  • 29. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Mirror Maker and Audit  Mirror Maker – Lag – Dropped Messages  Audit Consumer – Lag – Completeness check  Audit UI 29 Producer Cluster ClusterMM MessagesMessage Counts Audit Consumer All Messages Audit State Audit Consumer Audit UI Audit State
  • 30. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Audit UI 30
  • 31. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Audit UI 31
  • 32. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Tuning 32
  • 33. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Hardware and OS  Kernel Tuning – Swapping is Death – Allow more dirty pages – Allow less dirty cache  Disk throughput – More spindles – Longer commit interval 33
  • 34. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Java Virtual Machine 34
  • 35. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Garbage Collection 35
  • 36. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Garbage Collection  Java 7, update 51  Garbage First (G1) Collector – Set the heap size – Specify a target GC pause time – Don’t set the New size  GC Times – Less than 15ms per second in GC – Steady 20-22ms GC intervals – Almost no full GC cycles (and only 200-400ms when it does) 36
  • 37. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Closing 37
  • 38. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. What’s Coming in 0.8.2  Consumer offsets in the broker  Delete topic  Further down the road – New producer – Improved producer API 38
  • 39. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Upcoming Operational Work  Learning to share  Shrinking a cluster  Cluster comparison  Advanced monitoring 39
  • 40. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. How Can You Get Involved?  http://kafka.apache.org  Join the mailing lists – users@kafka.apache.org  irc.freenode.net - #apache-kafka  Contribute tools 40
  • 41. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Talk To Us  Kafka SREs at LinkedIn – Clark Haskins  https://www.linkedin.com/in/clarkhaskins  chaskins@linkedin.com – Todd Palino  https://www.linkedin.com/in/toddpalino  tpalino@linkedin.com 41
  • 42. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Questions 42