SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Enterprise Kafka
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Why Am I Here?
 You want to find out what this “Kafka” thing is
 You’re running Kafka, but you want to go big
 You’re looking for some neat whizbangs
2
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Clark Haskins
Todd Palino
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Who Are We?
 Kafka SRE at LinkedIn
 Site Reliability Engineering
– Administrators
– Architects
– Developers
 Keep the site running, always
4
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Kafka Overview
5
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
What Is Kafka?
6
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
What Is Kafka?
Broker
A
P0
A
P1
A
P0
7
Consumer
Producer
Zookeeper
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Attributes of a Kafka Cluster
 Disk Based
 Durable
 Scalable
 Low Latency
 Finite Retention
 NOT Idempotent (yet)
8
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Kafka At LinkedIn
 Multiple Datacenters, Multiple Clusters
 Mirroring between clusters
 Message Types
– Metrics
– Tracking
– Queuing
 Data transport from applications to Hadoop, and back
9
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Kafka At LinkedIn
10
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Kafka At LinkedIn
 300+ Kafka brokers
 Over 18,000 topics
 140,000+ Partitions
 220 Billion messages per day
 40 Terabytes In
 160 Terabytes Out
 Peak Load
– 3.25 Million messages per second
– 5.5 Gigabits/sec Inbound
– 18 Gigabits/sec Outbound
11
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Challenges We Have Overcome
12
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Solutions
 Kafka is young…..we Influenced development
 Operations wizardry…
13
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Hyper Growth
 Need to expand clusters to keep up with site traffic, and then balance them.
14
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Adding brokers
15
Brokers
Consumers
Producers
A
P1
A
P0
B
P1
B
P0
a
P5
A
P4
B
P5
B
P4
A
P3
A
P2
B
P3
B
P2
A
P7
A
P6
B
P7
B
P6
A
P5
A
P4
B
P5
B
P4
A
P1
A
P0
B
P1
B
P0
A
P7
A
P6
B
P7
B
P6
A
P3
A
P2
B
P3
B
P2
C
P1
C
P0
C
P3
C
P2
C
P1
C
P0
C
P3
C
P2
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Adding a broker(with broker leveling)
16
Brokers
Consumers
Producers
A
P1
A
P0
B
P1
B
P0
A
P5
A
P4
B
P5
B
P4
A
P3
A
P2
B
P3
B
P2
A
P7
A
P6
B
P7
B
P6
A
P5
A
P4
B
P5
B
P4
A
P1
A
P0
B
P1
B
P0
A
P7
A
P6
B
P7
B
P6
A
P3
A
P2
B
P3
B
P2
C
P1
C
P0
C
P3
C
P2
C
P1
C
P0
C
P3
C
P2
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Logs vs. Metrics
 Logging data killed the metrics cluster
17
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Quality of Service with Kafka
18
Brokers
Consumers
Producers
A
P1
A
P0
B
P1
B
P0
A
P5
A
P4
B
P5
B
P4
A
P3
A
P2
B
P3
B
P2
A
P7
A
P6
B
P7
B
P6
A
P5
A
P4
B
P5
B
P4
A
P1
A
P0
B
P1
B
P0
A
P7
A
P6
B
P7
B
P6
A
P3
A
P2
B
P3
B
P2
C
P1
C
P0
C
P3
C
P2
C
P1
C
P0
C
P3
C
P2
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Deployment Nightmares
 Parallel deployment wasn’t possible so…
 Babysitting sequential deployments
19
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Easy deployments
 Kafka 0.8.1 makes sure the cluster is in a good state before shutting down
– If any brokers in the cluster have under replicated partitions, Kafka will not shut
down
– Kafka ensures that only 1 broker is in shutdown sequence at a time.
20
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Killing Zookeeper
 Consumer offset management done within Zookeeper
 Every consumer committing offsets every minute for every partition makes
ZK very unhappy.
21
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Zookeeper on SSD
22
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Monitoring
23
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Kafka Is Broken!
24
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Kafka Is Broken!
 Everything is Kafka’s fault first
 What is lag?
 Consumer Problems
– Application problems
– Kafka client problems
25
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
How Do We Sleep At Night?
 Educating Users
– Why lag is their fault
 Monitoring the Ecosystem
– Kafka Brokers
– Zookeeper
– Mirror Makers
– Audit
– REST Interfaces
 Week Over Week
26
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Cluster Health and Utilization
 Under replicated partitions
 Offline partitions
 Broker partition count
 Data size on disk
 Leader partition count
 Network utilization
27
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Zookeeper
 Ensemble availability
 Latency
 Outstanding requests
28
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Mirror Maker and Audit
 Mirror Maker
– Lag
– Dropped Messages
 Audit Consumer
– Lag
– Completeness check
 Audit UI
29
Producer
Cluster ClusterMM
MessagesMessage
Counts
Audit
Consumer
All
Messages
Audit
State
Audit
Consumer
Audit
UI
Audit
State
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Audit UI
30
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Audit UI
31
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Tuning
32
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Hardware and OS
 Kernel Tuning
– Swapping is Death
– Allow more dirty pages
– Allow less dirty cache
 Disk throughput
– More spindles
– Longer commit interval
33
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Java Virtual Machine
34
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Garbage Collection
35
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Garbage Collection
 Java 7, update 51
 Garbage First (G1) Collector
– Set the heap size
– Specify a target GC pause time
– Don’t set the New size
 GC Times
– Less than 15ms per second in GC
– Steady 20-22ms GC intervals
– Almost no full GC cycles (and only 200-400ms when it does)
36
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Closing
37
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
What’s Coming in 0.8.2
 Consumer offsets in the broker
 Delete topic
 Further down the road
– New producer
– Improved producer API
38
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Upcoming Operational Work
 Learning to share
 Shrinking a cluster
 Cluster comparison
 Advanced monitoring
39
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
How Can You Get Involved?
 http://kafka.apache.org
 Join the mailing lists
– users@kafka.apache.org
 irc.freenode.net - #apache-kafka
 Contribute tools
40
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Talk To Us
 Kafka SREs at LinkedIn
– Clark Haskins
 https://www.linkedin.com/in/clarkhaskins
 chaskins@linkedin.com
– Todd Palino
 https://www.linkedin.com/in/toddpalino
 tpalino@linkedin.com
41
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Questions
42
Enterprise Kafka: Kafka as a Service

Enterprise Kafka: Kafka as a Service

  • 1.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Enterprise Kafka
  • 2.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Why Am I Here?  You want to find out what this “Kafka” thing is  You’re running Kafka, but you want to go big  You’re looking for some neat whizbangs 2
  • 3.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Clark Haskins Todd Palino
  • 4.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Who Are We?  Kafka SRE at LinkedIn  Site Reliability Engineering – Administrators – Architects – Developers  Keep the site running, always 4
  • 5.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Kafka Overview 5
  • 6.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. What Is Kafka? 6
  • 7.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. What Is Kafka? Broker A P0 A P1 A P0 7 Consumer Producer Zookeeper
  • 8.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Attributes of a Kafka Cluster  Disk Based  Durable  Scalable  Low Latency  Finite Retention  NOT Idempotent (yet) 8
  • 9.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Kafka At LinkedIn  Multiple Datacenters, Multiple Clusters  Mirroring between clusters  Message Types – Metrics – Tracking – Queuing  Data transport from applications to Hadoop, and back 9
  • 10.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Kafka At LinkedIn 10
  • 11.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Kafka At LinkedIn  300+ Kafka brokers  Over 18,000 topics  140,000+ Partitions  220 Billion messages per day  40 Terabytes In  160 Terabytes Out  Peak Load – 3.25 Million messages per second – 5.5 Gigabits/sec Inbound – 18 Gigabits/sec Outbound 11
  • 12.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Challenges We Have Overcome 12
  • 13.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Solutions  Kafka is young…..we Influenced development  Operations wizardry… 13
  • 14.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Hyper Growth  Need to expand clusters to keep up with site traffic, and then balance them. 14
  • 15.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Adding brokers 15 Brokers Consumers Producers A P1 A P0 B P1 B P0 a P5 A P4 B P5 B P4 A P3 A P2 B P3 B P2 A P7 A P6 B P7 B P6 A P5 A P4 B P5 B P4 A P1 A P0 B P1 B P0 A P7 A P6 B P7 B P6 A P3 A P2 B P3 B P2 C P1 C P0 C P3 C P2 C P1 C P0 C P3 C P2
  • 16.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Adding a broker(with broker leveling) 16 Brokers Consumers Producers A P1 A P0 B P1 B P0 A P5 A P4 B P5 B P4 A P3 A P2 B P3 B P2 A P7 A P6 B P7 B P6 A P5 A P4 B P5 B P4 A P1 A P0 B P1 B P0 A P7 A P6 B P7 B P6 A P3 A P2 B P3 B P2 C P1 C P0 C P3 C P2 C P1 C P0 C P3 C P2
  • 17.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Logs vs. Metrics  Logging data killed the metrics cluster 17
  • 18.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Quality of Service with Kafka 18 Brokers Consumers Producers A P1 A P0 B P1 B P0 A P5 A P4 B P5 B P4 A P3 A P2 B P3 B P2 A P7 A P6 B P7 B P6 A P5 A P4 B P5 B P4 A P1 A P0 B P1 B P0 A P7 A P6 B P7 B P6 A P3 A P2 B P3 B P2 C P1 C P0 C P3 C P2 C P1 C P0 C P3 C P2
  • 19.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Deployment Nightmares  Parallel deployment wasn’t possible so…  Babysitting sequential deployments 19
  • 20.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Easy deployments  Kafka 0.8.1 makes sure the cluster is in a good state before shutting down – If any brokers in the cluster have under replicated partitions, Kafka will not shut down – Kafka ensures that only 1 broker is in shutdown sequence at a time. 20
  • 21.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Killing Zookeeper  Consumer offset management done within Zookeeper  Every consumer committing offsets every minute for every partition makes ZK very unhappy. 21
  • 22.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Zookeeper on SSD 22
  • 23.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Monitoring 23
  • 24.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Kafka Is Broken! 24
  • 25.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Kafka Is Broken!  Everything is Kafka’s fault first  What is lag?  Consumer Problems – Application problems – Kafka client problems 25
  • 26.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. How Do We Sleep At Night?  Educating Users – Why lag is their fault  Monitoring the Ecosystem – Kafka Brokers – Zookeeper – Mirror Makers – Audit – REST Interfaces  Week Over Week 26
  • 27.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Cluster Health and Utilization  Under replicated partitions  Offline partitions  Broker partition count  Data size on disk  Leader partition count  Network utilization 27
  • 28.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Zookeeper  Ensemble availability  Latency  Outstanding requests 28
  • 29.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Mirror Maker and Audit  Mirror Maker – Lag – Dropped Messages  Audit Consumer – Lag – Completeness check  Audit UI 29 Producer Cluster ClusterMM MessagesMessage Counts Audit Consumer All Messages Audit State Audit Consumer Audit UI Audit State
  • 30.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Audit UI 30
  • 31.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Audit UI 31
  • 32.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Tuning 32
  • 33.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Hardware and OS  Kernel Tuning – Swapping is Death – Allow more dirty pages – Allow less dirty cache  Disk throughput – More spindles – Longer commit interval 33
  • 34.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Java Virtual Machine 34
  • 35.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Garbage Collection 35
  • 36.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Garbage Collection  Java 7, update 51  Garbage First (G1) Collector – Set the heap size – Specify a target GC pause time – Don’t set the New size  GC Times – Less than 15ms per second in GC – Steady 20-22ms GC intervals – Almost no full GC cycles (and only 200-400ms when it does) 36
  • 37.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Closing 37
  • 38.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. What’s Coming in 0.8.2  Consumer offsets in the broker  Delete topic  Further down the road – New producer – Improved producer API 38
  • 39.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Upcoming Operational Work  Learning to share  Shrinking a cluster  Cluster comparison  Advanced monitoring 39
  • 40.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. How Can You Get Involved?  http://kafka.apache.org  Join the mailing lists – users@kafka.apache.org  irc.freenode.net - #apache-kafka  Contribute tools 40
  • 41.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Talk To Us  Kafka SREs at LinkedIn – Clark Haskins  https://www.linkedin.com/in/clarkhaskins  chaskins@linkedin.com – Todd Palino  https://www.linkedin.com/in/toddpalino  tpalino@linkedin.com 41
  • 42.
    SITE RELIABILITY ENGINEERING©2014LinkedIn Corporation. All Rights Reserved. Questions 42