SlideShare a Scribd company logo
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Enterprise Kafka
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Why Am I Here?
 You want to find out what this “Kafka” thing is
 You’re running Kafka, but you want to go big
 You’re looking for some neat whizbangs
2
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Clark Haskins
Todd Palino
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Who Are We?
 Kafka SRE at LinkedIn
 Site Reliability Engineering
– Administrators
– Architects
– Developers
 Keep the site running, always
4
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Kafka Overview
5
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
What Is Kafka?
6
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
What Is Kafka?
Broker
A
P0
A
P1
A
P0
7
Consumer
Producer
Zookeeper
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Attributes of a Kafka Cluster
 Disk Based
 Durable
 Scalable
 Low Latency
 Finite Retention
 NOT Idempotent (yet)
8
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Kafka At LinkedIn
 Multiple Datacenters, Multiple Clusters
 Mirroring between clusters
 Message Types
– Metrics
– Tracking
– Queuing
 Data transport from applications to Hadoop, and back
9
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Kafka At LinkedIn
10
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Kafka At LinkedIn
 300+ Kafka brokers
 Over 18,000 topics
 140,000+ Partitions
 220 Billion messages per day
 40 Terabytes In
 160 Terabytes Out
 Peak Load
– 3.25 Million messages per second
– 5.5 Gigabits/sec Inbound
– 18 Gigabits/sec Outbound
11
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Challenges We Have Overcome
12
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Solutions
 Kafka is young…..we Influenced development
 Operations wizardry…
13
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Hyper Growth
 Need to expand clusters to keep up with site traffic, and then balance them.
14
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Adding brokers
15
Brokers
Consumers
Producers
A
P1
A
P0
B
P1
B
P0
a
P5
A
P4
B
P5
B
P4
A
P3
A
P2
B
P3
B
P2
A
P7
A
P6
B
P7
B
P6
A
P5
A
P4
B
P5
B
P4
A
P1
A
P0
B
P1
B
P0
A
P7
A
P6
B
P7
B
P6
A
P3
A
P2
B
P3
B
P2
C
P1
C
P0
C
P3
C
P2
C
P1
C
P0
C
P3
C
P2
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Adding a broker(with broker leveling)
16
Brokers
Consumers
Producers
A
P1
A
P0
B
P1
B
P0
A
P5
A
P4
B
P5
B
P4
A
P3
A
P2
B
P3
B
P2
A
P7
A
P6
B
P7
B
P6
A
P5
A
P4
B
P5
B
P4
A
P1
A
P0
B
P1
B
P0
A
P7
A
P6
B
P7
B
P6
A
P3
A
P2
B
P3
B
P2
C
P1
C
P0
C
P3
C
P2
C
P1
C
P0
C
P3
C
P2
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Logs vs. Metrics
 Logging data killed the metrics cluster
17
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Quality of Service with Kafka
18
Brokers
Consumers
Producers
A
P1
A
P0
B
P1
B
P0
A
P5
A
P4
B
P5
B
P4
A
P3
A
P2
B
P3
B
P2
A
P7
A
P6
B
P7
B
P6
A
P5
A
P4
B
P5
B
P4
A
P1
A
P0
B
P1
B
P0
A
P7
A
P6
B
P7
B
P6
A
P3
A
P2
B
P3
B
P2
C
P1
C
P0
C
P3
C
P2
C
P1
C
P0
C
P3
C
P2
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Deployment Nightmares
 Parallel deployment wasn’t possible so…
 Babysitting sequential deployments
19
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Easy deployments
 Kafka 0.8.1 makes sure the cluster is in a good state before shutting down
– If any brokers in the cluster have under replicated partitions, Kafka will not shut
down
– Kafka ensures that only 1 broker is in shutdown sequence at a time.
20
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Killing Zookeeper
 Consumer offset management done within Zookeeper
 Every consumer committing offsets every minute for every partition makes
ZK very unhappy.
21
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Zookeeper on SSD
22
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Monitoring
23
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Kafka Is Broken!
24
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Kafka Is Broken!
 Everything is Kafka’s fault first
 What is lag?
 Consumer Problems
– Application problems
– Kafka client problems
25
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
How Do We Sleep At Night?
 Educating Users
– Why lag is their fault
 Monitoring the Ecosystem
– Kafka Brokers
– Zookeeper
– Mirror Makers
– Audit
– REST Interfaces
 Week Over Week
26
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Cluster Health and Utilization
 Under replicated partitions
 Offline partitions
 Broker partition count
 Data size on disk
 Leader partition count
 Network utilization
27
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Zookeeper
 Ensemble availability
 Latency
 Outstanding requests
28
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Mirror Maker and Audit
 Mirror Maker
– Lag
– Dropped Messages
 Audit Consumer
– Lag
– Completeness check
 Audit UI
29
Producer
Cluster ClusterMM
MessagesMessage
Counts
Audit
Consumer
All
Messages
Audit
State
Audit
Consumer
Audit
UI
Audit
State
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Audit UI
30
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Audit UI
31
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Tuning
32
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Hardware and OS
 Kernel Tuning
– Swapping is Death
– Allow more dirty pages
– Allow less dirty cache
 Disk throughput
– More spindles
– Longer commit interval
33
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Java Virtual Machine
34
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Garbage Collection
35
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Garbage Collection
 Java 7, update 51
 Garbage First (G1) Collector
– Set the heap size
– Specify a target GC pause time
– Don’t set the New size
 GC Times
– Less than 15ms per second in GC
– Steady 20-22ms GC intervals
– Almost no full GC cycles (and only 200-400ms when it does)
36
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Closing
37
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
What’s Coming in 0.8.2
 Consumer offsets in the broker
 Delete topic
 Further down the road
– New producer
– Improved producer API
38
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Upcoming Operational Work
 Learning to share
 Shrinking a cluster
 Cluster comparison
 Advanced monitoring
39
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
How Can You Get Involved?
 http://kafka.apache.org
 Join the mailing lists
– users@kafka.apache.org
 irc.freenode.net - #apache-kafka
 Contribute tools
40
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Talk To Us
 Kafka SREs at LinkedIn
– Clark Haskins
 https://www.linkedin.com/in/clarkhaskins
 chaskins@linkedin.com
– Todd Palino
 https://www.linkedin.com/in/toddpalino
 tpalino@linkedin.com
41
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Questions
42
Enterprise Kafka: Kafka as a Service

More Related Content

What's hot

Azure API Management
Azure API ManagementAzure API Management
Azure API Management
Daniel Toomey
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)
Timothy Spann
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Jeff Holoman
 
API Gateway How-To: The Many Ways to Apply the Gateway Pattern
API Gateway How-To: The Many Ways to Apply the Gateway PatternAPI Gateway How-To: The Many Ways to Apply the Gateway Pattern
API Gateway How-To: The Many Ways to Apply the Gateway Pattern
VMware Tanzu
 
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
confluent
 
Introduction to CICD
Introduction to CICDIntroduction to CICD
Introduction to CICD
Knoldus Inc.
 
Kubernetes Introduction
Kubernetes IntroductionKubernetes Introduction
Kubernetes Introduction
Peng Xiao
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
confluent
 
VMware Tanzu Kubernetes Connect
VMware Tanzu Kubernetes ConnectVMware Tanzu Kubernetes Connect
VMware Tanzu Kubernetes Connect
VMware Tanzu
 
Why API Ops is the Next Wave of DevOps
Why API Ops is the Next Wave of DevOpsWhy API Ops is the Next Wave of DevOps
Why API Ops is the Next Wave of DevOps
John Musser
 
The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022
Kai Wähner
 
Secrets of Performance Tuning Java on Kubernetes
Secrets of Performance Tuning Java on KubernetesSecrets of Performance Tuning Java on Kubernetes
Secrets of Performance Tuning Java on Kubernetes
Bruno Borges
 
Using Rook to Manage Kubernetes Storage with Ceph
Using Rook to Manage Kubernetes Storage with CephUsing Rook to Manage Kubernetes Storage with Ceph
Using Rook to Manage Kubernetes Storage with Ceph
CloudOps2005
 
CKA Certified Kubernetes Administrator Notes
CKA Certified Kubernetes Administrator Notes CKA Certified Kubernetes Administrator Notes
CKA Certified Kubernetes Administrator Notes
Adnan Rashid
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
Amita Mirajkar
 
Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around Kafka
Guido Schmutz
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
Kubernetes Architecture - beyond a black box - Part 1
Kubernetes Architecture - beyond a black box - Part 1Kubernetes Architecture - beyond a black box - Part 1
Kubernetes Architecture - beyond a black box - Part 1
Hao H. Zhang
 
Ansible with Jenkins in a CI/CD Process
Ansible with Jenkins in a CI/CD ProcessAnsible with Jenkins in a CI/CD Process
Ansible with Jenkins in a CI/CD Process
Khairul Zebua
 

What's hot (20)

Azure API Management
Azure API ManagementAzure API Management
Azure API Management
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
API Gateway How-To: The Many Ways to Apply the Gateway Pattern
API Gateway How-To: The Many Ways to Apply the Gateway PatternAPI Gateway How-To: The Many Ways to Apply the Gateway Pattern
API Gateway How-To: The Many Ways to Apply the Gateway Pattern
 
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
 
Introduction to CICD
Introduction to CICDIntroduction to CICD
Introduction to CICD
 
Kubernetes Introduction
Kubernetes IntroductionKubernetes Introduction
Kubernetes Introduction
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
VMware Tanzu Kubernetes Connect
VMware Tanzu Kubernetes ConnectVMware Tanzu Kubernetes Connect
VMware Tanzu Kubernetes Connect
 
Why API Ops is the Next Wave of DevOps
Why API Ops is the Next Wave of DevOpsWhy API Ops is the Next Wave of DevOps
Why API Ops is the Next Wave of DevOps
 
The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022The Top 5 Apache Kafka Use Cases and Architectures in 2022
The Top 5 Apache Kafka Use Cases and Architectures in 2022
 
Secrets of Performance Tuning Java on Kubernetes
Secrets of Performance Tuning Java on KubernetesSecrets of Performance Tuning Java on Kubernetes
Secrets of Performance Tuning Java on Kubernetes
 
Using Rook to Manage Kubernetes Storage with Ceph
Using Rook to Manage Kubernetes Storage with CephUsing Rook to Manage Kubernetes Storage with Ceph
Using Rook to Manage Kubernetes Storage with Ceph
 
CKA Certified Kubernetes Administrator Notes
CKA Certified Kubernetes Administrator Notes CKA Certified Kubernetes Administrator Notes
CKA Certified Kubernetes Administrator Notes
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around Kafka
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Kubernetes Architecture - beyond a black box - Part 1
Kubernetes Architecture - beyond a black box - Part 1Kubernetes Architecture - beyond a black box - Part 1
Kubernetes Architecture - beyond a black box - Part 1
 
Ansible with Jenkins in a CI/CD Process
Ansible with Jenkins in a CI/CD ProcessAnsible with Jenkins in a CI/CD Process
Ansible with Jenkins in a CI/CD Process
 

Similar to Enterprise Kafka: Kafka as a Service

Kafka overview and use cases
Kafka overview and use casesKafka overview and use cases
Kafka overview and use cases
Indrajeet Kumar
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
Todd Palino
 
WebRTC Infrastructure the Hard Parts: Media
WebRTC Infrastructure the Hard Parts: MediaWebRTC Infrastructure the Hard Parts: Media
WebRTC Infrastructure the Hard Parts: Media
Dialogic Inc.
 
Scribe Online CDK & Connector Development
Scribe Online CDK & Connector DevelopmentScribe Online CDK & Connector Development
Scribe Online CDK & Connector Development
CloudFronts Technologies LLP.
 
Linked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafkaLinked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafka
Nitin Kumar
 
Multi tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaMulti tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafka
Todd Palino
 
Web rtc infrastructure the hard parts v4
Web rtc infrastructure the hard parts v4Web rtc infrastructure the hard parts v4
Web rtc infrastructure the hard parts v4Dialogic Inc.
 
Cisco Connect Vancouver 2017 - Cisco's Digital Network Architecture - deeper ...
Cisco Connect Vancouver 2017 - Cisco's Digital Network Architecture - deeper ...Cisco Connect Vancouver 2017 - Cisco's Digital Network Architecture - deeper ...
Cisco Connect Vancouver 2017 - Cisco's Digital Network Architecture - deeper ...
Cisco Canada
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
ThousandEyes
 
Continuous Delivery pour vos applications avec Cloud Foundry et Jenkins
Continuous Delivery pour vos applications avec Cloud Foundry et JenkinsContinuous Delivery pour vos applications avec Cloud Foundry et Jenkins
Continuous Delivery pour vos applications avec Cloud Foundry et Jenkins
Erwan Bornier
 
App-First & Cloud-Native: How InterMiles Boosted CX with AWS & Infostretch
App-First & Cloud-Native: How InterMiles Boosted CX with AWS & InfostretchApp-First & Cloud-Native: How InterMiles Boosted CX with AWS & Infostretch
App-First & Cloud-Native: How InterMiles Boosted CX with AWS & Infostretch
Infostretch
 
Cisco Meraki - Simplifying Powerful Technology
Cisco Meraki - Simplifying Powerful TechnologyCisco Meraki - Simplifying Powerful Technology
Cisco Meraki - Simplifying Powerful Technology
Cisco Canada
 
Concevoir et déployer vos applications a base de microservices sur Cloud Foundry
Concevoir et déployer vos applications a base de microservices sur Cloud FoundryConcevoir et déployer vos applications a base de microservices sur Cloud Foundry
Concevoir et déployer vos applications a base de microservices sur Cloud Foundry
VMware Tanzu
 
Build your first IoT device - The tricky interface of Product and R&D with Ni...
Build your first IoT device - The tricky interface of Product and R&D with Ni...Build your first IoT device - The tricky interface of Product and R&D with Ni...
Build your first IoT device - The tricky interface of Product and R&D with Ni...
Product of Things
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
Todd Palino
 
INTERFACE, by apidays - Design and Build Great Web APIs
INTERFACE, by apidays - Design and Build Great Web APIsINTERFACE, by apidays - Design and Build Great Web APIs
INTERFACE, by apidays - Design and Build Great Web APIs
apidays
 
IBM i Development: Increase Accuracy and Efficiency with SEQUEL's ABSTRACT a...
 IBM i Development: Increase Accuracy and Efficiency with SEQUEL's ABSTRACT a... IBM i Development: Increase Accuracy and Efficiency with SEQUEL's ABSTRACT a...
IBM i Development: Increase Accuracy and Efficiency with SEQUEL's ABSTRACT a...
HelpSystems
 
What does it take to be an architect
What does it take to be an architectWhat does it take to be an architect
What does it take to be an architect
Constantine Slisenka
 
What does it take to be architect (for Cjicago JUG)
What does it take to be architect (for Cjicago JUG)What does it take to be architect (for Cjicago JUG)
What does it take to be architect (for Cjicago JUG)
Constantine Slisenka
 
Vbrownbag container networking for real workloads
Vbrownbag container networking for real workloadsVbrownbag container networking for real workloads
Vbrownbag container networking for real workloads
Cisco DevNet
 

Similar to Enterprise Kafka: Kafka as a Service (20)

Kafka overview and use cases
Kafka overview and use casesKafka overview and use cases
Kafka overview and use cases
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
 
WebRTC Infrastructure the Hard Parts: Media
WebRTC Infrastructure the Hard Parts: MediaWebRTC Infrastructure the Hard Parts: Media
WebRTC Infrastructure the Hard Parts: Media
 
Scribe Online CDK & Connector Development
Scribe Online CDK & Connector DevelopmentScribe Online CDK & Connector Development
Scribe Online CDK & Connector Development
 
Linked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafkaLinked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafka
 
Multi tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaMulti tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafka
 
Web rtc infrastructure the hard parts v4
Web rtc infrastructure the hard parts v4Web rtc infrastructure the hard parts v4
Web rtc infrastructure the hard parts v4
 
Cisco Connect Vancouver 2017 - Cisco's Digital Network Architecture - deeper ...
Cisco Connect Vancouver 2017 - Cisco's Digital Network Architecture - deeper ...Cisco Connect Vancouver 2017 - Cisco's Digital Network Architecture - deeper ...
Cisco Connect Vancouver 2017 - Cisco's Digital Network Architecture - deeper ...
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Continuous Delivery pour vos applications avec Cloud Foundry et Jenkins
Continuous Delivery pour vos applications avec Cloud Foundry et JenkinsContinuous Delivery pour vos applications avec Cloud Foundry et Jenkins
Continuous Delivery pour vos applications avec Cloud Foundry et Jenkins
 
App-First & Cloud-Native: How InterMiles Boosted CX with AWS & Infostretch
App-First & Cloud-Native: How InterMiles Boosted CX with AWS & InfostretchApp-First & Cloud-Native: How InterMiles Boosted CX with AWS & Infostretch
App-First & Cloud-Native: How InterMiles Boosted CX with AWS & Infostretch
 
Cisco Meraki - Simplifying Powerful Technology
Cisco Meraki - Simplifying Powerful TechnologyCisco Meraki - Simplifying Powerful Technology
Cisco Meraki - Simplifying Powerful Technology
 
Concevoir et déployer vos applications a base de microservices sur Cloud Foundry
Concevoir et déployer vos applications a base de microservices sur Cloud FoundryConcevoir et déployer vos applications a base de microservices sur Cloud Foundry
Concevoir et déployer vos applications a base de microservices sur Cloud Foundry
 
Build your first IoT device - The tricky interface of Product and R&D with Ni...
Build your first IoT device - The tricky interface of Product and R&D with Ni...Build your first IoT device - The tricky interface of Product and R&D with Ni...
Build your first IoT device - The tricky interface of Product and R&D with Ni...
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
 
INTERFACE, by apidays - Design and Build Great Web APIs
INTERFACE, by apidays - Design and Build Great Web APIsINTERFACE, by apidays - Design and Build Great Web APIs
INTERFACE, by apidays - Design and Build Great Web APIs
 
IBM i Development: Increase Accuracy and Efficiency with SEQUEL's ABSTRACT a...
 IBM i Development: Increase Accuracy and Efficiency with SEQUEL's ABSTRACT a... IBM i Development: Increase Accuracy and Efficiency with SEQUEL's ABSTRACT a...
IBM i Development: Increase Accuracy and Efficiency with SEQUEL's ABSTRACT a...
 
What does it take to be an architect
What does it take to be an architectWhat does it take to be an architect
What does it take to be an architect
 
What does it take to be architect (for Cjicago JUG)
What does it take to be architect (for Cjicago JUG)What does it take to be architect (for Cjicago JUG)
What does it take to be architect (for Cjicago JUG)
 
Vbrownbag container networking for real workloads
Vbrownbag container networking for real workloadsVbrownbag container networking for real workloads
Vbrownbag container networking for real workloads
 

More from Todd Palino

Leading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderLeading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical Leader
Todd Palino
 
From Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsFrom Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy Steps
Todd Palino
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Todd Palino
 
Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?
Todd Palino
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to Know
Todd Palino
 
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Todd Palino
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum Pain
Todd Palino
 
I'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInI'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedIn
Todd Palino
 
More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More Problems
Todd Palino
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
Todd Palino
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and Profit
Todd Palino
 

More from Todd Palino (11)

Leading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderLeading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical Leader
 
From Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsFrom Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy Steps
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
 
Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to Know
 
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum Pain
 
I'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInI'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedIn
 
More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More Problems
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and Profit
 

Recently uploaded

一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 

Recently uploaded (20)

一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 

Enterprise Kafka: Kafka as a Service

  • 1. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Enterprise Kafka
  • 2. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Why Am I Here?  You want to find out what this “Kafka” thing is  You’re running Kafka, but you want to go big  You’re looking for some neat whizbangs 2
  • 3. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Clark Haskins Todd Palino
  • 4. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Who Are We?  Kafka SRE at LinkedIn  Site Reliability Engineering – Administrators – Architects – Developers  Keep the site running, always 4
  • 5. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Kafka Overview 5
  • 6. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. What Is Kafka? 6
  • 7. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. What Is Kafka? Broker A P0 A P1 A P0 7 Consumer Producer Zookeeper
  • 8. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Attributes of a Kafka Cluster  Disk Based  Durable  Scalable  Low Latency  Finite Retention  NOT Idempotent (yet) 8
  • 9. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Kafka At LinkedIn  Multiple Datacenters, Multiple Clusters  Mirroring between clusters  Message Types – Metrics – Tracking – Queuing  Data transport from applications to Hadoop, and back 9
  • 10. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Kafka At LinkedIn 10
  • 11. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Kafka At LinkedIn  300+ Kafka brokers  Over 18,000 topics  140,000+ Partitions  220 Billion messages per day  40 Terabytes In  160 Terabytes Out  Peak Load – 3.25 Million messages per second – 5.5 Gigabits/sec Inbound – 18 Gigabits/sec Outbound 11
  • 12. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Challenges We Have Overcome 12
  • 13. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Solutions  Kafka is young…..we Influenced development  Operations wizardry… 13
  • 14. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Hyper Growth  Need to expand clusters to keep up with site traffic, and then balance them. 14
  • 15. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Adding brokers 15 Brokers Consumers Producers A P1 A P0 B P1 B P0 a P5 A P4 B P5 B P4 A P3 A P2 B P3 B P2 A P7 A P6 B P7 B P6 A P5 A P4 B P5 B P4 A P1 A P0 B P1 B P0 A P7 A P6 B P7 B P6 A P3 A P2 B P3 B P2 C P1 C P0 C P3 C P2 C P1 C P0 C P3 C P2
  • 16. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Adding a broker(with broker leveling) 16 Brokers Consumers Producers A P1 A P0 B P1 B P0 A P5 A P4 B P5 B P4 A P3 A P2 B P3 B P2 A P7 A P6 B P7 B P6 A P5 A P4 B P5 B P4 A P1 A P0 B P1 B P0 A P7 A P6 B P7 B P6 A P3 A P2 B P3 B P2 C P1 C P0 C P3 C P2 C P1 C P0 C P3 C P2
  • 17. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Logs vs. Metrics  Logging data killed the metrics cluster 17
  • 18. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Quality of Service with Kafka 18 Brokers Consumers Producers A P1 A P0 B P1 B P0 A P5 A P4 B P5 B P4 A P3 A P2 B P3 B P2 A P7 A P6 B P7 B P6 A P5 A P4 B P5 B P4 A P1 A P0 B P1 B P0 A P7 A P6 B P7 B P6 A P3 A P2 B P3 B P2 C P1 C P0 C P3 C P2 C P1 C P0 C P3 C P2
  • 19. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Deployment Nightmares  Parallel deployment wasn’t possible so…  Babysitting sequential deployments 19
  • 20. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Easy deployments  Kafka 0.8.1 makes sure the cluster is in a good state before shutting down – If any brokers in the cluster have under replicated partitions, Kafka will not shut down – Kafka ensures that only 1 broker is in shutdown sequence at a time. 20
  • 21. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Killing Zookeeper  Consumer offset management done within Zookeeper  Every consumer committing offsets every minute for every partition makes ZK very unhappy. 21
  • 22. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Zookeeper on SSD 22
  • 23. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Monitoring 23
  • 24. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Kafka Is Broken! 24
  • 25. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Kafka Is Broken!  Everything is Kafka’s fault first  What is lag?  Consumer Problems – Application problems – Kafka client problems 25
  • 26. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. How Do We Sleep At Night?  Educating Users – Why lag is their fault  Monitoring the Ecosystem – Kafka Brokers – Zookeeper – Mirror Makers – Audit – REST Interfaces  Week Over Week 26
  • 27. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Cluster Health and Utilization  Under replicated partitions  Offline partitions  Broker partition count  Data size on disk  Leader partition count  Network utilization 27
  • 28. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Zookeeper  Ensemble availability  Latency  Outstanding requests 28
  • 29. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Mirror Maker and Audit  Mirror Maker – Lag – Dropped Messages  Audit Consumer – Lag – Completeness check  Audit UI 29 Producer Cluster ClusterMM MessagesMessage Counts Audit Consumer All Messages Audit State Audit Consumer Audit UI Audit State
  • 30. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Audit UI 30
  • 31. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Audit UI 31
  • 32. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Tuning 32
  • 33. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Hardware and OS  Kernel Tuning – Swapping is Death – Allow more dirty pages – Allow less dirty cache  Disk throughput – More spindles – Longer commit interval 33
  • 34. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Java Virtual Machine 34
  • 35. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Garbage Collection 35
  • 36. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Garbage Collection  Java 7, update 51  Garbage First (G1) Collector – Set the heap size – Specify a target GC pause time – Don’t set the New size  GC Times – Less than 15ms per second in GC – Steady 20-22ms GC intervals – Almost no full GC cycles (and only 200-400ms when it does) 36
  • 37. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Closing 37
  • 38. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. What’s Coming in 0.8.2  Consumer offsets in the broker  Delete topic  Further down the road – New producer – Improved producer API 38
  • 39. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Upcoming Operational Work  Learning to share  Shrinking a cluster  Cluster comparison  Advanced monitoring 39
  • 40. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. How Can You Get Involved?  http://kafka.apache.org  Join the mailing lists – users@kafka.apache.org  irc.freenode.net - #apache-kafka  Contribute tools 40
  • 41. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Talk To Us  Kafka SREs at LinkedIn – Clark Haskins  https://www.linkedin.com/in/clarkhaskins  chaskins@linkedin.com – Todd Palino  https://www.linkedin.com/in/toddpalino  tpalino@linkedin.com 41
  • 42. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Questions 42