Vahid Hashemian and Ambud Sharma from Pinterest discuss Kafka operations at their company. Pinterest uses over 50 Kafka clusters with 2,500+ brokers to ingest 20+ GB/s of data and output 50+ GB/s. They faced challenges around performance, costs, and dynamic partitioning. To address these, Pinterest developed automation tools like Orion to manage clusters and topics, upgraded all clusters to Kafka 2.3.1+, and learned lessons around testing upgrade versions and backward compatibility. Going forward, they aim to improve interoperability, scaling, efficiency, and reliability.
Organic Growth and Good Sleep: Effective Kafka Operations at Pinterest
1. Organic Growth and a
Good Night Sleep:
Effective Kafka
Operations at Pinterest
Kafka Summit 2020 Vahid Hashemian August 24, 2020Ambud Sharma
2. 1
2
3
4
5
6
7
Agenda Who We Are
What is Pinterest
Kafka at Pinterest
Challenges/Lessons Learned
Automation
The Upgrade Story
The Road Ahead
3. 1 Who we are
Vahid Hashemian
Software Engineer
Logging Platform, Pinterest
Committer, PMC Member
Ambud Sharma
EM & TL
Logging Platform, Pinterest
And other members of Logging Platform team:
Eric Lopez, Heng Zhang, Henry Cai, Jeff Xiang, Ping-Min Lin
6. ● 367 million Global Monthly Active Users
● 240 billion Pins saved
● 5B+ boards
● 82% of all Pinners use Pinterest on mobile
● 91% of Pinners say that Pinterest is filled with positivity.
● 89% of Pinners report that they leave the site feeling empowered.
2 What is Pinterest
8. Scale of Data Ingestion Pipelines:
● Hosted on AWS EC2
● 50+ Kafka clusters (in production)
● 2,500+ Kafka brokers
● 3,000+ Kafka topics, 150K+ Kafka partitions
● Inbound traffic: ~ 20 GB/s (max)
● Outbound traffic: ~ 50 GB/s (max)
Current Version:
● 2.3.1 with cherry-picked commits
3 Kafka at Pinterest
9. ● Performance Issues
○ Magnetic disks to NVMe SSDs
○ Dynamic rebalancing -> Static rebalancing
○ Message conversions (implicit throttling)
● Cost control
○ Data transfer -> Rack awareness
○ Compressed topics
○ Retentions and Replication factor tuning
● Placement and auto-balancing partitions (brokersets)
○ New topics
○ Traffic pattern changes
4 Challenges / Lessons Learned
10. 5 Automation
Wizards Git Repo Orion
Brokerset 1 Brokerset 2
Brokerset
3
Topic 1 Topic 2
Topic 3
Topic 4 Topic 5
Cluster X
● Each brokerset is a group of 1 or more
brokers
● Brokerset is defined using broker id
ranges
● Brokersets can overlap
● Topic partition counts are a multiple of
brokersets
● Currently there are 2 types of
brokersets (Capacity & Static)
● Topics are assigned based on
capacity requirements & capacity
available
● All topics are managed via config
check-in (infra as code)
● Wizard provides a simple interface
for users to generate topic code
Topic 6
12. 5 Automation
We developed Orion as a Unified Management System for Stateful Distributed Systems
1. UI
2. Automation
3. Cluster Management
4. Alerting
Orion manages all Kafka clusters at Pinterest.
13. 5 Automation
We developed Orion as a Unified Management System for Stateful Distributed Systems
1. UI
2. Automation
3. Cluster Management
4. Alerting
Orion manages all Kafka clusters at Pinterest.
14. 5 Automation
We developed Orion as a Unified Management System for Stateful Distributed Systems
1. UI
2. Automation
3. Cluster Management
4. Alerting
Orion manages all Kafka clusters at Pinterest.
15. 5 Automation
We developed Orion as a Unified Management System for Stateful Distributed Systems
1. UI
2. Automation
3. Cluster Management
4. Alerting
Orion manages all Kafka clusters at Pinterest.
16. ● All clusters were upgraded earlier this year to 2.3.1+
● Post-upgrade versions:
○ Broker: 2.3.1+ [2.3.1 + cherry picked commits]
○ Inter broker protocol: 2.3-IV1
○ Log message format: 2.3-IV1
6 The Upgrade Story
17. ● Broker version
○ The version of the Kafka binary the broker runs.
● Inter broker protocol version
○ The version on which brokers communicate with each other.
○ Maximum allowed value is the minimum broker version in the cluster.
○ Once upgraded, cannot be rolled back.
● Log message format version
○ The version of format used by brokers to store messages.
○ Can be granular to per topic.
○ Guarantees that messages on disk are of smaller or equal version to the
version configured for a topic.
○ If set incorrectly, could break old consumers (prior to 0.10.2 that are not
forward compatible).
6 The Upgrade Story
19. Worthy lessons to share
1. Not every release of
Apache Kafka may work
for your specific use
cases. Do your due
diligence before choosing
the version to upgrade to.
6 The Upgrade Story
20. Worthy lessons to share
2. Check all blocker / critical /
major bugs reported against
your upgrade version of choice.
6 The Upgrade Story
21. 6 The Upgrade Story
Worthy lessons to share
3. Bug fix releases are usually
safer options for upgrade,
but the risk is never 0.
22. Worthy lessons to share
4. If you have clusters on old log message format versions, make sure their
clients would not be affected by the upgrade.
6 The Upgrade Story
Scala / Java
23. Worthy lessons to share
4. If you have clusters on old log message format versions, make sure their
clients would not be affected by the upgrade.
6 The Upgrade Story
librdkafka
24. Worthy lessons to share
4. If you have clusters on old log message format versions, make sure their
clients would not be affected by the upgrade.
6 The Upgrade Story
kafka-python
25. Worthy lessons to share
4. If you have clusters on old log message format versions, make sure their
clients would not be affected by the upgrade.
6 The Upgrade Story
confluent-kafka-python
26. ● Interoperability
○ Abstract implementation (Kafka internals etc.) from client applications.
● Scaling and efficiency
○ Provide true horizontal and dynamic scaling.
○ Reduce cost per GB moved.
● Reliability
○ Automate on-call support
○ Measure & pinpoint data loss
8 The Road Ahead
27. Pinterest Engineering Blog (on Medium)
● Using graph algorithms to optimize Kafka operations (Part 1, Part 2)
● Optimizing Kafka for the cloud
● Open sourcing Singer, Pinterest’s performant and reliable logging agent
● How Pinterest runs Kafka at scale
Pinterest is hiring
● https://careers.pinterest.com/
Additional Resources