SlideShare a Scribd company logo
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka at Peak Performance
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Todd Palino
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Who Am I?
3
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka At LinkedIn
 1100+ Kafka brokers
 Over 32,000 topics
 350,000+ Partitions
 875 Billion messages per day
 185 Terabytes In
 675 Terabytes Out
 Peak Load (whole site)
– 10.5 Million messages/sec
– 18.5 Gigabits/sec Inbound
– 70.5 Gigabits/sec Outbound
4
 1800+ Kafka brokers
 Over 79,000 topics
 1,130,000+ Partitions
 1.3 Trillion messages per day
 330 Terabytes In
 1.2 Petabytes Out
 Peak Load (single cluster)
– 2 Million messages/sec
– 4.7 Gigabits/sec Inbound
– 15 Gigabits/sec Outbound
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
What Will We Talk About?
 Picking Your Hardware
 Monitoring the Cluster
 Triaging Broker Performance Problems
 Conclusion
5
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Hardware Selection
6
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
What’s Important To You?
 Message Retention - Disk size
 Message Throughput - Network capacity
 Producer Performance - Disk I/O
 Consumer Performance - Memory
7
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Go Wide
 Kafka is well-suited to horizontal scaling
 RAIS - Redundant Array of Inexpensive Servers
 Also helps with CPU utilization
– Kafka needs to decompress and recompress every message batch
– KIP-31 will help with this by eliminating recompression
 Don’t co-locate Kafka
8
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Disk Layout
 RAID
– Can survive a single disk failure (not RAID 0)
– Provides the broker with a single log directory
– Eats up disk I/O
 JBOD
– Gives Kafka all the disk I/O available
– Broker is not smart about balancing partitions
– If one disk fails, the entire broker stops
 Amazon EBS performance works!
9
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Operating System Tuning
 Filesystem Options
– EXT or XFS
– Using unsafe mount options
 Virtual Memory
– Swappiness
– Dirty Pages
 Networking
10
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Java
 Only use JDK 8 now
 Keep heap size small
– Even our largest brokers use a 6 GB heap
– Save the rest for page cache
 Garbage Collection - G1 all the way
– Basic tuning only
– Watch for humongous allocations
11
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
How Much Do You Need?
12
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Buy The Book!
13
Early Access available now.
Covers all aspects of Kafka,
from setup to client
development to ongoing
administration and
troubleshooting.
Also discusses stream
processing and other use
cases.
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Cluster Sizing
 How big for your local cluster?
– How much disk space do you have?
– How much network bandwidth do you have?
– CPU, memory, disk I/O
 How big for your aggregate cluster?
– In general, multiple the number of brokers by the number of local clusters
– May have additional concerns with lots of consumers
14
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Topic Configuration
 Partition Counts for Local
– Many theories on how to do this correctly, but the answer is “it depends”
– How many consumers do you have?
– Do you have specific partition requirements?
– Keeping partition sizes manageable
 Partition Counts for Aggregate
– Multiply the number of partitions in a local cluster by the number of local clusters
– Periodically review partition counts in all clusters
 Message Retention
– If aggregate is where you really need the messages, only retain it in local for long
enough to cover mirror maker problems
15
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Possible Broker Improvements
 Namespaces
– Namespace topics by datacenter
– Eliminate local clusters and just have aggregate
– Significant hardware savings
 JBOD Fixes
– Intelligent partition assignment
– Admin tools to move partitions between mount points
– Broker should not fail completely with a single disk failure
16
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Administrative Improvements
 Multiple cluster management
– Topic management across clusters
– Visualization of mirror maker paths
 Better client monitoring
– Burrow for consumer monitoring
– No open source solution for producer monitoring (audit)
 End-to-end availability monitoring
17
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Keeping An Eye On Things
18
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Monitoring The Foundation
 CPU Load
 Network inbound and outbound
 Filehandle usage for Kafka
 Disk
– Free space - where you write logs, and where Kafka stores messages
– Free inodes
– I/O performance - at least average wait and percent utilization
 Garbage Collection
19
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Broker Ground Rules
 Tuning
– Stick (mostly) with the defaults
– Set default cluster retention as appropriate
– Default partition count should be at least the number of brokers
 Monitoring
– Watch the right things
– Don’t try to alert on everything
 Triage and Resolution
– Solve problems, don’t mask them
20
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Too Much Information!
 Monitoring teams hate Kafka
– Per-Topic metrics
– Per-Partition metrics
– Per-Client metrics
 Capture as much as you can
– Many metrics are useful while triaging an issue
 Clients want metrics on their own topics
 Only alert on what is needed to signal a problem
21
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Broker Monitoring
 Bytes In and Out, Messages In
– Why not messages out?
 Partitions
– Count and Leader Count
– Under Replicated and Offline
 Threads
– Network pool, Request pool
– Max Dirty Percent
 Requests
– Rates and times - total, queue, local, and send
22
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Topic Monitoring
 Bytes In, Bytes Out
 Messages In, Produce Rate, Produce Failure Rate
 Fetch Rate, Fetch Failure Rate
 Partition Bytes
 Log End Offset
– Why bother?
– KIP-32 will make this unnecessary
 Quota Throttling
 Provide this to your customers for them to alert on
23
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Client Monitoring
 For consumers, use Burrow
– Monitor all partitions for all consumers
– Provides an easy to digest “good, warning, bad” state, with detail available
– Fast and free
 Producers are a little harder
– Several internal implementations of message auditing
– The community needs a good open source standard
 Cluster availability monitoring
– kafka-monitoring is coming soon from LinkedIn!
24
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
It’s Broken! Now What?
25
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
All The Best Ops People…
 Know more of what is happening than their customers
 Are proactive
 Fix bugs, not work around them
 This applies to our developers too!
26
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Anticipating Trouble
 Trend cluster utilization and growth over time
 Use default configurations for quotas and retention to require customers to
talk to you
 Monitor request times
– If you are able to develop a consistent baseline, this is early warning
27
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Under Replicated Partitions
 Count of number of partitions which are not fully replicated within the
cluster
 Also referred to as “replica lag”
 Primary indicator of problems within the cluster
28
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Broker Performance Checks
 Are you still running 0.8?
 Are all the brokers in the cluster working?
 Are the network interfaces saturated?
– Reelect partition leaders
– Rebalance partitions in the cluster
– Spread out traffic more (increase partitions or brokers)
 Is the CPU utilization high? (especially iowait)
– Is another process competing for resources?
– Look for a bad disk
 Do you have really big messages?
29
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka’s OK, Now What?
 If Kafka is working properly, it’s probably a client issue
– Don’t throw it over the fence. Help your customers understand
 Common producer issues
– Batch size and linger time
– Receive and send buffers
– Sync vs. async, and acknowledgements
 Common consumer issues
– Garbage collection problems
– Min fetch bytes and max wait time
– Not enough partitions
30
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Conclusion
31
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
One Ecosystem
 Kafka can scale to millions of messages per second, and more
– Operations must scale the cluster appropriately
– Developers must use the right tuning and go parallel
 Few problems are owned by only one side
– Expanding partitions often requires coordination
– Applications that need higher reliability drive cluster configurations
 Either we work together, or we fail separately
32
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Would You Like To Know More?
 Presentations: http://www.slideshare.net/toddpalino
– More Datacenters, More Problems
– Kafka As A Service
– Always download the originals for slide notes!
 Blog Posts: https://engineering.linkedin.com/blog
– Development and SRE blogs on Kafka and other topics
 LinkedIn Open Source: https://github.com/linkedin/streaming
– Burrow Consumer Monitoring - https://github.com/linkedin/Burrow
– Kafka Admin Tools - https://github.com/linkedin/kafka-tools
33
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Getting Involved With Kafka
 http://kafka.apache.org
 Join the mailing lists
– users@kafka.apache.org
– dev@kafka.apache.org
 irc.freenode.net - #apache-kafka
 Meetups
– Apache Kafka - http://www.meetup.com/http-kafka-apache-org
– Bay Area Samza - http://www.meetup.com/Bay-Area-Samza-Meetup/
 Contribute code
34
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Data @ LinkedIn is Hiring!
 Streams Infrastructure
– Kafka pub/sub ecosystem
– Stream Processing Platform built on Apache Samza
– Next Generation change capture technology (incubating)
 LinkedIn
– Strong commitment to open source
– Do cool things and work with awesome people
 Join us in working on cutting edge stream processing infrastructures
– Please contact kparamasivam@linkedin.com
– Software developers and Site Reliability Engineers at all levels
35
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Appendix
37
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
JDK Options
Heap Size -Xmx6g -Xms6g
Metaspace -XX:MetaspaceSize=96m -XX:MinMetaspaceFreeRatio=50
-XX:MaxMetaspaceFreeRatio=80
G1 Tuning -XX:+UseG1GC -XX:MaxGCPauseMillis=20
-XX:InitiatingHeapOccupancyPercent=35
-XX:G1HeapRegionSize=16M
GC Logging -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution
-XX:+PrintGCDetails -XX:+PrintGCDateStamps
-XX:+PrintTenuringDistribution
-Xloggc:/path/to/logs/gc.log -verbose:gc
Error Handling -XX:-HeapDumpOnOutOfMemoryError
-XX:ErrorFile=/path/to/logs/hs_err.log
38
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
OS Tuning Parameters
 Networking:
net.core.rmem_default = 124928
net.core.rmem_max = 2048000
net.core.wmem_default = 124928
net.core.wmem_max = 2048000
net.ipv4.tcp_rmem = 4096 87380 4194304
net.ipv4.tcp_wmem = 4096 16384 4194304
net.ipv4.tcp_max_tw_buckets = 262144
net.ipv4.tcp_max_syn_backlog = 1024
39
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
OS Tuning Parameters (cont.)
 Virtual Memory
vm.oom_kill_allocating_task = 1
vm.max_map_count = 200000
vm.swappiness = 1
vm.dirty_writeback_centisecs = 500
vm.dirty_expire_centisecs = 500
vm.dirty_ratio = 60
vm.dirty_background_ratio = 5
40
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Broker Sensors
kafka.server:name=BytesInPerSec,type=BrokerTopicMetrics
kafka.server:name=BytesOutPerSec,type=BrokerTopicMetrics
kafka.server:name=MessagesInPerSec,type=BrokerTopicMetrics
kafka.server:name=PartitionCount,type=ReplicaManager
kafka.server:name=LeaderCount,type=ReplicaManager
kafka.server:name=UnderReplicatedPartitions,type=ReplicaManager
kafka.server:name=RequestHandlerAvgIdlePercent,type=KafkaRequestHandlerPool
kafka.controller:name=ActiveControllerCount,type=KafkaController
kafka.controller:name=OfflinePartitionsCount,type=KafkaController
kafka.log:name=max-dirty-percent,type=LogCleanerManager
kafka.network:name=NetworkProcessorAvgIdlePercent,type=SocketServer
kafka.network:name=RequestsPerSec=*,type=RequestMetrics
kafka.network:name=RequestQueueTimeMs,request=*,type=RequestMetrics
kafka.network:name=LocalTimeMs,request=*,type=RequestMetrics
kafka.network:name=RemoteTimeMs,request=*,type=RequestMetrics
kafka.network:name=ResponseQueueTimeMs,request=*,type=RequestMetrics
kafka.network:name=ResponseSendTimeMs,request=*,type=RequestMetrics
kafka.network:name=TotalTimeMs,request=*,type=RequestMetrics
41
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Broker Sensors - Topics
kafka.server:name=BytesInPerSec,type=BrokerTopicMetrics,topics=*
kafka.server:name=BytesOutPerSec,type=BrokerTopicMetrics,topics=*
kafka.server:name=MessagesInPerSec,type=BrokerTopicMetrics,topics=*
kafka.server:name=TotalProduceRequestsPerSec,type=BrokerTopicMetrics,topic=*
kafka.server:name=FailedProduceRequestsPerSec,type=BrokerTopicMetrics,topic=*
kafka.server:name=TotalFetchRequestsPerSec,type=BrokerTopicMetrics,topic=*
kafka.server:name=FailedFetchRequestsPerSec,type=BrokerTopicMetrics,topic=*
kafka.log:type=Log,name=LogEndOffset,topic=*,partition=*
42

More Related Content

What's hot

Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
confluent
 
Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafka
confluent
 
kafka
kafkakafka
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
Mohammed Fazuluddin
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
Jiangjie Qin
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3
SANG WON PARK
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
Guido Schmutz
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache Kafka
Jiangjie Qin
 
Kafka tiered-storage-meetup-2022-final-presented
Kafka tiered-storage-meetup-2022-final-presentedKafka tiered-storage-meetup-2022-final-presented
Kafka tiered-storage-meetup-2022-final-presented
Sumant Tambe
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelines
Sumant Tambe
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
Guozhang Wang
 
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafka
confluent
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
Flink Forward
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
Amir Sedighi
 
No data loss pipeline with apache kafka
No data loss pipeline with apache kafkaNo data loss pipeline with apache kafka
No data loss pipeline with apache kafka
Jiangjie Qin
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
Amita Mirajkar
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to Know
Todd Palino
 

What's hot (20)

Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafka
 
kafka
kafkakafka
kafka
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache Kafka
 
Kafka tiered-storage-meetup-2022-final-presented
Kafka tiered-storage-meetup-2022-final-presentedKafka tiered-storage-meetup-2022-final-presented
Kafka tiered-storage-meetup-2022-final-presented
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelines
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafka
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
 
No data loss pipeline with apache kafka
No data loss pipeline with apache kafkaNo data loss pipeline with apache kafka
No data loss pipeline with apache kafka
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to Know
 

Viewers also liked

Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
Todd Palino
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and Profit
Todd Palino
 
Challenges of a multi tenant kafka service
Challenges of a multi tenant kafka serviceChallenges of a multi tenant kafka service
Challenges of a multi tenant kafka service
Thomas Alex
 
Multi tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaMulti tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafka
Todd Palino
 
Microsoft challenges of a multi tenant kafka service
Microsoft challenges of a multi tenant kafka serviceMicrosoft challenges of a multi tenant kafka service
Microsoft challenges of a multi tenant kafka service
Nitin Kumar
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
Todd Palino
 

Viewers also liked (6)

Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and Profit
 
Challenges of a multi tenant kafka service
Challenges of a multi tenant kafka serviceChallenges of a multi tenant kafka service
Challenges of a multi tenant kafka service
 
Multi tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaMulti tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafka
 
Microsoft challenges of a multi tenant kafka service
Microsoft challenges of a multi tenant kafka serviceMicrosoft challenges of a multi tenant kafka service
Microsoft challenges of a multi tenant kafka service
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
 

Similar to Kafka at Peak Performance

Linked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafkaLinked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafka
Nitin Kumar
 
More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More Problems
Todd Palino
 
Data stream with cruise control
Data stream with cruise controlData stream with cruise control
Data stream with cruise control
Bill Liu
 
ARIN 34 IPv6 IAB/IETF Activities Report
ARIN 34 IPv6 IAB/IETF Activities ReportARIN 34 IPv6 IAB/IETF Activities Report
ARIN 34 IPv6 IAB/IETF Activities Report
ARIN
 
InfiniBand for the enterprise
InfiniBand for the enterpriseInfiniBand for the enterprise
InfiniBand for the enterpriseAnas Kanzoua
 
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
Filipe Miranda
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8
MongoDB
 
Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016
AdobeMarketingCloud
 
MySQL for Software-as-a-Service (SaaS)
MySQL for Software-as-a-Service (SaaS)MySQL for Software-as-a-Service (SaaS)
MySQL for Software-as-a-Service (SaaS)
Mario Beck
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
Jim Haughwout
 
The role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial InformaticsThe role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial Informatics
Aerospike, Inc.
 
Kafka 0.9, Things you should know
Kafka 0.9, Things you should knowKafka 0.9, Things you should know
Kafka 0.9, Things you should know
Ratish Ravindran
 
Management and Automation of MongoDB Clusters - Slides
Management and Automation of MongoDB Clusters - SlidesManagement and Automation of MongoDB Clusters - Slides
Management and Automation of MongoDB Clusters - Slides
Severalnines
 
Fast Online Access to Massive Offline Data - SECR 2016
Fast Online Access to Massive Offline Data - SECR 2016Fast Online Access to Massive Offline Data - SECR 2016
Fast Online Access to Massive Offline Data - SECR 2016
Felix GV
 
#IBMEdge: Flash Storage Session
#IBMEdge: Flash Storage Session#IBMEdge: Flash Storage Session
#IBMEdge: Flash Storage Session
Brocade
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
ScyllaDB
 
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
Dan Cundiff
 
CisCon 2018 - Analytics per Storage Area Networks
CisCon 2018 - Analytics per Storage Area NetworksCisCon 2018 - Analytics per Storage Area Networks
CisCon 2018 - Analytics per Storage Area Networks
AreaNetworking.it
 
Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Group, Inc.
 
MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...
MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...
MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...
MongoDB
 

Similar to Kafka at Peak Performance (20)

Linked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafkaLinked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafka
 
More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More Problems
 
Data stream with cruise control
Data stream with cruise controlData stream with cruise control
Data stream with cruise control
 
ARIN 34 IPv6 IAB/IETF Activities Report
ARIN 34 IPv6 IAB/IETF Activities ReportARIN 34 IPv6 IAB/IETF Activities Report
ARIN 34 IPv6 IAB/IETF Activities Report
 
InfiniBand for the enterprise
InfiniBand for the enterpriseInfiniBand for the enterprise
InfiniBand for the enterprise
 
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8
 
Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016
 
MySQL for Software-as-a-Service (SaaS)
MySQL for Software-as-a-Service (SaaS)MySQL for Software-as-a-Service (SaaS)
MySQL for Software-as-a-Service (SaaS)
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
 
The role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial InformaticsThe role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial Informatics
 
Kafka 0.9, Things you should know
Kafka 0.9, Things you should knowKafka 0.9, Things you should know
Kafka 0.9, Things you should know
 
Management and Automation of MongoDB Clusters - Slides
Management and Automation of MongoDB Clusters - SlidesManagement and Automation of MongoDB Clusters - Slides
Management and Automation of MongoDB Clusters - Slides
 
Fast Online Access to Massive Offline Data - SECR 2016
Fast Online Access to Massive Offline Data - SECR 2016Fast Online Access to Massive Offline Data - SECR 2016
Fast Online Access to Massive Offline Data - SECR 2016
 
#IBMEdge: Flash Storage Session
#IBMEdge: Flash Storage Session#IBMEdge: Flash Storage Session
#IBMEdge: Flash Storage Session
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
 
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
 
CisCon 2018 - Analytics per Storage Area Networks
CisCon 2018 - Analytics per Storage Area NetworksCisCon 2018 - Analytics per Storage Area Networks
CisCon 2018 - Analytics per Storage Area Networks
 
Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016
 
MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...
MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...
MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...
 

More from Todd Palino

Leading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderLeading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical Leader
Todd Palino
 
From Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsFrom Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy Steps
Todd Palino
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Todd Palino
 
Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?
Todd Palino
 
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Todd Palino
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum Pain
Todd Palino
 
I'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInI'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedIn
Todd Palino
 
Enterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a ServiceEnterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a Service
Todd Palino
 

More from Todd Palino (8)

Leading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderLeading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical Leader
 
From Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsFrom Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy Steps
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
 
Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?
 
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum Pain
 
I'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInI'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedIn
 
Enterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a ServiceEnterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a Service
 

Recently uploaded

Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
abh.arya
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
PrashantGoswami42
 
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSETECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
DuvanRamosGarzon1
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
Kamal Acharya
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 

Recently uploaded (20)

Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSETECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 

Kafka at Peak Performance

  • 1. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka at Peak Performance
  • 2. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Todd Palino
  • 3. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Who Am I? 3
  • 4. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka At LinkedIn  1100+ Kafka brokers  Over 32,000 topics  350,000+ Partitions  875 Billion messages per day  185 Terabytes In  675 Terabytes Out  Peak Load (whole site) – 10.5 Million messages/sec – 18.5 Gigabits/sec Inbound – 70.5 Gigabits/sec Outbound 4  1800+ Kafka brokers  Over 79,000 topics  1,130,000+ Partitions  1.3 Trillion messages per day  330 Terabytes In  1.2 Petabytes Out  Peak Load (single cluster) – 2 Million messages/sec – 4.7 Gigabits/sec Inbound – 15 Gigabits/sec Outbound
  • 5. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. What Will We Talk About?  Picking Your Hardware  Monitoring the Cluster  Triaging Broker Performance Problems  Conclusion 5
  • 6. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Hardware Selection 6
  • 7. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. What’s Important To You?  Message Retention - Disk size  Message Throughput - Network capacity  Producer Performance - Disk I/O  Consumer Performance - Memory 7
  • 8. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Go Wide  Kafka is well-suited to horizontal scaling  RAIS - Redundant Array of Inexpensive Servers  Also helps with CPU utilization – Kafka needs to decompress and recompress every message batch – KIP-31 will help with this by eliminating recompression  Don’t co-locate Kafka 8
  • 9. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Disk Layout  RAID – Can survive a single disk failure (not RAID 0) – Provides the broker with a single log directory – Eats up disk I/O  JBOD – Gives Kafka all the disk I/O available – Broker is not smart about balancing partitions – If one disk fails, the entire broker stops  Amazon EBS performance works! 9
  • 10. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Operating System Tuning  Filesystem Options – EXT or XFS – Using unsafe mount options  Virtual Memory – Swappiness – Dirty Pages  Networking 10
  • 11. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Java  Only use JDK 8 now  Keep heap size small – Even our largest brokers use a 6 GB heap – Save the rest for page cache  Garbage Collection - G1 all the way – Basic tuning only – Watch for humongous allocations 11
  • 12. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. How Much Do You Need? 12
  • 13. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Buy The Book! 13 Early Access available now. Covers all aspects of Kafka, from setup to client development to ongoing administration and troubleshooting. Also discusses stream processing and other use cases.
  • 14. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Cluster Sizing  How big for your local cluster? – How much disk space do you have? – How much network bandwidth do you have? – CPU, memory, disk I/O  How big for your aggregate cluster? – In general, multiple the number of brokers by the number of local clusters – May have additional concerns with lots of consumers 14
  • 15. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Topic Configuration  Partition Counts for Local – Many theories on how to do this correctly, but the answer is “it depends” – How many consumers do you have? – Do you have specific partition requirements? – Keeping partition sizes manageable  Partition Counts for Aggregate – Multiply the number of partitions in a local cluster by the number of local clusters – Periodically review partition counts in all clusters  Message Retention – If aggregate is where you really need the messages, only retain it in local for long enough to cover mirror maker problems 15
  • 16. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Possible Broker Improvements  Namespaces – Namespace topics by datacenter – Eliminate local clusters and just have aggregate – Significant hardware savings  JBOD Fixes – Intelligent partition assignment – Admin tools to move partitions between mount points – Broker should not fail completely with a single disk failure 16
  • 17. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Administrative Improvements  Multiple cluster management – Topic management across clusters – Visualization of mirror maker paths  Better client monitoring – Burrow for consumer monitoring – No open source solution for producer monitoring (audit)  End-to-end availability monitoring 17
  • 18. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Keeping An Eye On Things 18
  • 19. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Monitoring The Foundation  CPU Load  Network inbound and outbound  Filehandle usage for Kafka  Disk – Free space - where you write logs, and where Kafka stores messages – Free inodes – I/O performance - at least average wait and percent utilization  Garbage Collection 19
  • 20. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Broker Ground Rules  Tuning – Stick (mostly) with the defaults – Set default cluster retention as appropriate – Default partition count should be at least the number of brokers  Monitoring – Watch the right things – Don’t try to alert on everything  Triage and Resolution – Solve problems, don’t mask them 20
  • 21. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Too Much Information!  Monitoring teams hate Kafka – Per-Topic metrics – Per-Partition metrics – Per-Client metrics  Capture as much as you can – Many metrics are useful while triaging an issue  Clients want metrics on their own topics  Only alert on what is needed to signal a problem 21
  • 22. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Broker Monitoring  Bytes In and Out, Messages In – Why not messages out?  Partitions – Count and Leader Count – Under Replicated and Offline  Threads – Network pool, Request pool – Max Dirty Percent  Requests – Rates and times - total, queue, local, and send 22
  • 23. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Topic Monitoring  Bytes In, Bytes Out  Messages In, Produce Rate, Produce Failure Rate  Fetch Rate, Fetch Failure Rate  Partition Bytes  Log End Offset – Why bother? – KIP-32 will make this unnecessary  Quota Throttling  Provide this to your customers for them to alert on 23
  • 24. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Client Monitoring  For consumers, use Burrow – Monitor all partitions for all consumers – Provides an easy to digest “good, warning, bad” state, with detail available – Fast and free  Producers are a little harder – Several internal implementations of message auditing – The community needs a good open source standard  Cluster availability monitoring – kafka-monitoring is coming soon from LinkedIn! 24
  • 25. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. It’s Broken! Now What? 25
  • 26. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. All The Best Ops People…  Know more of what is happening than their customers  Are proactive  Fix bugs, not work around them  This applies to our developers too! 26
  • 27. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Anticipating Trouble  Trend cluster utilization and growth over time  Use default configurations for quotas and retention to require customers to talk to you  Monitor request times – If you are able to develop a consistent baseline, this is early warning 27
  • 28. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Under Replicated Partitions  Count of number of partitions which are not fully replicated within the cluster  Also referred to as “replica lag”  Primary indicator of problems within the cluster 28
  • 29. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Broker Performance Checks  Are you still running 0.8?  Are all the brokers in the cluster working?  Are the network interfaces saturated? – Reelect partition leaders – Rebalance partitions in the cluster – Spread out traffic more (increase partitions or brokers)  Is the CPU utilization high? (especially iowait) – Is another process competing for resources? – Look for a bad disk  Do you have really big messages? 29
  • 30. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka’s OK, Now What?  If Kafka is working properly, it’s probably a client issue – Don’t throw it over the fence. Help your customers understand  Common producer issues – Batch size and linger time – Receive and send buffers – Sync vs. async, and acknowledgements  Common consumer issues – Garbage collection problems – Min fetch bytes and max wait time – Not enough partitions 30
  • 31. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Conclusion 31
  • 32. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. One Ecosystem  Kafka can scale to millions of messages per second, and more – Operations must scale the cluster appropriately – Developers must use the right tuning and go parallel  Few problems are owned by only one side – Expanding partitions often requires coordination – Applications that need higher reliability drive cluster configurations  Either we work together, or we fail separately 32
  • 33. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Would You Like To Know More?  Presentations: http://www.slideshare.net/toddpalino – More Datacenters, More Problems – Kafka As A Service – Always download the originals for slide notes!  Blog Posts: https://engineering.linkedin.com/blog – Development and SRE blogs on Kafka and other topics  LinkedIn Open Source: https://github.com/linkedin/streaming – Burrow Consumer Monitoring - https://github.com/linkedin/Burrow – Kafka Admin Tools - https://github.com/linkedin/kafka-tools 33
  • 34. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Getting Involved With Kafka  http://kafka.apache.org  Join the mailing lists – users@kafka.apache.org – dev@kafka.apache.org  irc.freenode.net - #apache-kafka  Meetups – Apache Kafka - http://www.meetup.com/http-kafka-apache-org – Bay Area Samza - http://www.meetup.com/Bay-Area-Samza-Meetup/  Contribute code 34
  • 35. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Data @ LinkedIn is Hiring!  Streams Infrastructure – Kafka pub/sub ecosystem – Stream Processing Platform built on Apache Samza – Next Generation change capture technology (incubating)  LinkedIn – Strong commitment to open source – Do cool things and work with awesome people  Join us in working on cutting edge stream processing infrastructures – Please contact kparamasivam@linkedin.com – Software developers and Site Reliability Engineers at all levels 35
  • 36.
  • 37. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Appendix 37
  • 38. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. JDK Options Heap Size -Xmx6g -Xms6g Metaspace -XX:MetaspaceSize=96m -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80 G1 Tuning -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M GC Logging -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:/path/to/logs/gc.log -verbose:gc Error Handling -XX:-HeapDumpOnOutOfMemoryError -XX:ErrorFile=/path/to/logs/hs_err.log 38
  • 39. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. OS Tuning Parameters  Networking: net.core.rmem_default = 124928 net.core.rmem_max = 2048000 net.core.wmem_default = 124928 net.core.wmem_max = 2048000 net.ipv4.tcp_rmem = 4096 87380 4194304 net.ipv4.tcp_wmem = 4096 16384 4194304 net.ipv4.tcp_max_tw_buckets = 262144 net.ipv4.tcp_max_syn_backlog = 1024 39
  • 40. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. OS Tuning Parameters (cont.)  Virtual Memory vm.oom_kill_allocating_task = 1 vm.max_map_count = 200000 vm.swappiness = 1 vm.dirty_writeback_centisecs = 500 vm.dirty_expire_centisecs = 500 vm.dirty_ratio = 60 vm.dirty_background_ratio = 5 40
  • 41. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Broker Sensors kafka.server:name=BytesInPerSec,type=BrokerTopicMetrics kafka.server:name=BytesOutPerSec,type=BrokerTopicMetrics kafka.server:name=MessagesInPerSec,type=BrokerTopicMetrics kafka.server:name=PartitionCount,type=ReplicaManager kafka.server:name=LeaderCount,type=ReplicaManager kafka.server:name=UnderReplicatedPartitions,type=ReplicaManager kafka.server:name=RequestHandlerAvgIdlePercent,type=KafkaRequestHandlerPool kafka.controller:name=ActiveControllerCount,type=KafkaController kafka.controller:name=OfflinePartitionsCount,type=KafkaController kafka.log:name=max-dirty-percent,type=LogCleanerManager kafka.network:name=NetworkProcessorAvgIdlePercent,type=SocketServer kafka.network:name=RequestsPerSec=*,type=RequestMetrics kafka.network:name=RequestQueueTimeMs,request=*,type=RequestMetrics kafka.network:name=LocalTimeMs,request=*,type=RequestMetrics kafka.network:name=RemoteTimeMs,request=*,type=RequestMetrics kafka.network:name=ResponseQueueTimeMs,request=*,type=RequestMetrics kafka.network:name=ResponseSendTimeMs,request=*,type=RequestMetrics kafka.network:name=TotalTimeMs,request=*,type=RequestMetrics 41
  • 42. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Broker Sensors - Topics kafka.server:name=BytesInPerSec,type=BrokerTopicMetrics,topics=* kafka.server:name=BytesOutPerSec,type=BrokerTopicMetrics,topics=* kafka.server:name=MessagesInPerSec,type=BrokerTopicMetrics,topics=* kafka.server:name=TotalProduceRequestsPerSec,type=BrokerTopicMetrics,topic=* kafka.server:name=FailedProduceRequestsPerSec,type=BrokerTopicMetrics,topic=* kafka.server:name=TotalFetchRequestsPerSec,type=BrokerTopicMetrics,topic=* kafka.server:name=FailedFetchRequestsPerSec,type=BrokerTopicMetrics,topic=* kafka.log:type=Log,name=LogEndOffset,topic=*,partition=* 42