Kafka Summit SF 2017 - Running Kafka for Maximum Pain

Running Kafka
For Maximum Pain
Todd Palino
Senior Staff Engineer, Site Reliability
LinkedIn

To All The Tech Debt
I’ve Loved Before
Todd Palino
Senior Staff Engineer, Site Reliability
LinkedIn

T E C H N I C A L D E B T
The cost of the rework required by choosing
an easy solution now.

SRE
❤s
SWE
• Both roles are critical
• Work together to balance
operability and features
• SRE’s job is to enable SWE to
move as quickly as possible
while meeting SLOs

How Big?
• Produced
• Every day
2Trillion
Messages
• Single cluster
• Unique data
5Gbps
Inbound
• Average 3x
consumption
• Before mirroring
18Gbps
Outbound
• Largest clusters
are 250k
• Up to 10k
partitions per
broker
2.5M
Partitions

Sources of Pain
Exponentially increase
your problems by
sharing them
Multitenancy
Kafka’s great!
Everything else around
it sucks
Infrastructure
What do you mean I
have to do it myself?
Management

Sharing is Caring
• Reduces the hardware footprint
• Less administrative overhead
• One bad actor makes everyone’s life hard

Types of Data
• Member-related
Activity
• Data schemas are
managed by
DMRC
• Aggregated to
some datacenters
Tracking Metrics Queuing Logging
• Application
metrics, service
calls, logs
• Mostly produced
by application
containers
• Only aggregated
to backend
datacenters
• Internal
application data,
messaging
• Largest users are
Samza and Search
• Limited
aggregation in
production only
• Dedicated cluster
for application
logs going to ELK
• High volume, low
retention
• Not aggregated

Multitenancy Woes
• Auto topic creation means
nobody knows who created it
• Multiple producers further
clouds the issue
• Who makes decisions?
• Who is responsible for problems?
Ownership Capacity Security
• No controls means it’s free!
• Getting one person to project
growth is hard
• Getting 100 people to do it is
impossible
• Storage hardware is not
commodity
• Started with zero security
• Impossible to handle sensitive
data

Improvements
• Added an ownership metadata
service
• One committee with control over
shared data schemas
• Moving to disable automatic
topic creation
Ownership Capacity Security
• Quotas to limit bandwidth
• Retention by both time and
bytes to restrict disk usage
• Also forces customers to talk to
us about data usage
• Move all clients to SSL
• Add ACLs for existing usage (after
review)
• Starting to evaluate encryption

Mirror Maker
• Every change requires a restart
• Grows n2 with number of sites
• Inefficient since 0.8
• Loses key to partition affinity

Mirror Maker
Performance
• Added identity handler for fixed
partition mapping
• Eliminated compression
• Finally off old consumer
• Coming soon to a KIP near you

Message Auditing
• Required to assure mirroring
works
• Makes infrastructure care about
data schema
• Only tracks producers (mostly)
• Relational database doesn’t cut it
for storing audit data

Streaming
Audit
• Moving audit data to headers
• Utilizing Samza for processing
counts
• Adding “cost to serve”
information

Topic Configuration
• No way to manage configs across
multiple clusters
• Creating a new datacenter is a
manual process
• Changes need to be propagated in
a specific order
• Administrative commands are not
protected

Nuage
• One-stop shop for Data
Infrastructure
• Allows creation of topics with
ownership and ACLs
• Uses our Kafka REST interface
for CRUD

Cluster Membership
• No tool to remove brokers
• New brokers take no traffic
• Partition reassignment is basic
• Automatic leader election kills the
cluster

Round 1:
kafka-tools
• kafka-assigner:
• Remove broker
• Rebalance replicas
• Fix replication factor
• Protocol CLI tool
• Adding an admin client
• github.com/linkedin/kafka-tools

Round 2:
Cruise Control
• Dynamic workload rebalancing
• Self-healing clusters
• Manages multiple goals (network,
disk, CPU, rack)
• Requires no additional code
• Open source now!

What Needs Attention?
• Very few metrics
• One bad partition breaks it
Log Compaction Client Config Upgrading
• Client and broker cannot
negotiate
• Configurations are essentially
shared secrets
• No information on the version of
clients connecting
• Message format changes are still
troubling
• Broker upgrades must be
carefully ordered
• Often no clear way to roll back

Make It Easier
• Cruise Control
• https://github.com/linkedin/cruise-control
• Kafka Monitor
• https://github.com/linkedin/kafka-monitor
• Burrow
• https://github.com/linkedin/Burrow
• kafka-tools
• https://github.com/linkedin/kafka-tools
LinkedIn Open Source Get Involved
• Community
• users@kafka.apache.org
• dev@kafka.apache.org
• Bugs and Work:
• https://issues.apache.org/jira/projects/KAFKA

Kafka Summit SF 2017 - Running Kafka for Maximum Pain

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kafka Summit SF 2017 - Running Kafka for Maximum Pain

Similar to Kafka Summit SF 2017 - Running Kafka for Maximum Pain (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

Kafka Summit SF 2017 - Running Kafka for Maximum Pain