Successfully reported this slideshow.
Your SlideShare is downloading. ×

Twitter's Apache Kafka® Adoption Journey

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 19 Ad

Twitter's Apache Kafka® Adoption Journey

Ming Liu, Twitter, Tech Lead + Yiming Zang, Twitter, Senior Software Engineer

Until recently, the Messaging team at Twitter had been running an in-house Pub/Sub system, namely EventBus (built on top of Apache DistributedLog and Apache Bookkeeper, and similar in architecture to Apache Pulsar) to cater to our pubsub needs. In 2018, we made the decision to move to Apache Kafka by migrating existing use cases as well as onboarding new use cases directly onto Apache Kafka. Fast forward to today, Kafka is now an essential piece of Twitter Infrastructure and processes over 150M messages per second. In this meetup talk, we will share the learning and challenges in our journey moving to Apache Kafka.

https://www.meetup.com/KafkaBayArea/events/272643868/

Ming Liu, Twitter, Tech Lead + Yiming Zang, Twitter, Senior Software Engineer

Until recently, the Messaging team at Twitter had been running an in-house Pub/Sub system, namely EventBus (built on top of Apache DistributedLog and Apache Bookkeeper, and similar in architecture to Apache Pulsar) to cater to our pubsub needs. In 2018, we made the decision to move to Apache Kafka by migrating existing use cases as well as onboarding new use cases directly onto Apache Kafka. Fast forward to today, Kafka is now an essential piece of Twitter Infrastructure and processes over 150M messages per second. In this meetup talk, we will share the learning and challenges in our journey moving to Apache Kafka.

https://www.meetup.com/KafkaBayArea/events/272643868/

Advertisement
Advertisement

More Related Content

More from confluent (20)

Recently uploaded (20)

Advertisement

Twitter's Apache Kafka® Adoption Journey

  1. 1. 2020 September 2nd Twitter’s Apache Kafka Adoption Journey Ming Liu Yiming Zang
  2. 2. Overview Kafka Ecosystem at Twitter 03 ● Ecosystem we built around Kafka at Twitter Challenges & Experience 02 ● Challenges we have faced ● Experience and success stories Kafka Adoption 01 ● Why we adopted Kafka ● Kafka migration story at Twitter
  3. 3. EventBus
  4. 4. Why Kafka Apache Kafka Open Source Ecosystem Lower Cost Features
  5. 5. Self Serve Migration Client Eventbus Provisioning Service Kafka Provisioning Service Replicate Dynamic Switch Migration tool
  6. 6. Migration Challenges Migration Scale ● Thousands of topics ● Tens of thousands of consumers Mission Critical ● Revenue impact ● Product impact ● Hundreds of teams ● Customer support after migration Cooperation
  7. 7. Challenges and experience in Kafka ● Overall, very successful experience migrating to Kafka with a diverse and high throughput workloads ● This talk will focus mainly on the challenges we faced and share our learning ● Assume basic understanding and operation knowledge on Kafka
  8. 8. Kafka today at Twitter (Sep 2020) ● Cluster: ○ On-prem, bare-metal powered by Aurora Mesos ○ Will move some workloads to public cloud in the future ● Broker Config: ○ 4TB SSD, will also support 24TB SSD soon ○ JVM heap 5GB to 15GB, pagecache around 45GB currently ● Scale: ○ 80 clusters, up to 200 broker per cluster ○ > 2000 topics, > 40k subscribers ○ 160M EPS, 160G BPS Ingress, 900G BPS Egress ● Version: ○ Broker: 2.5, Client/Stream: 2.2 (upgrade in progress)
  9. 9. Latency sensitivity and tuning ● Architecture: ○ Shared storage and serving layer ○ Durable by Replication ● 1. Deployment ○ Automatically config topic throttle rate before each broker. ● 2. Noisy neighbor ○ Quota management ● 3. Traffic fluctuation ○ ACK=ALL, catch up or high fanout affecting producing latency
  10. 10. User cases: High fanin/High fanout ● High fanin: Thousands of publishers Batch efficiency: Publish to limited set of partitions (KIP 480 StickyPartitionPublisher) ● High fanout: Thousands of subscribers Example: ML training Egress is close to NIC speed (10 G bits per seconds) Higher produce latency and fetch latency
  11. 11. Latency Learnings ● Many variables affect latency: ○ Producer config: linger.ms, batch.size, compression.type ○ Follower config: replica.fetch.min.bytes, replica.fetch.wait.max.ms, number.replica.fetchers ○ Consumer config: fetch.min.bytes, fetch.max.wait.ms ○ Connections, Partitions, Clients ● Network thread optimization ○ Connection round robin assignment. The utilization on network processors is balanced. ○ Avoid IO-blocking calls from the network thread (KAFKA-7504) ○ KIP idea: Separate producer/follower fetch from consumer fetch.
  12. 12. Client Request KafkaBroker Processor Acceptor Processor Processor Hander Thread Handler Thread Handler Thread Client Request Client Request OP_ACCEPT Socket Channel Network thread Request handler Kafka Network IO
  13. 13. Idempotent and Transactional Producers ● Idempotent producer: ○ Learning: Broker side produce failure can trigger OOM on broker and cause broker fail to start. That is because broker caches all ProducerStateEntry for up to retention time. ○ Proper handling of OutOfOrderSequenceException. ● Transactional producer (and exactly once in Kafka Stream): ○ More stable over the last 2 years. ○ We have to fix a few bugs and upstream them. Still run into some problems occasionally (like orphaned transaction)
  14. 14. Kafka Stream ● Open Source Finatra: native Kafka Stream integration ○ Allow fast creation of a fully functional KafkaStream service ○ Support custom DSL and async processor and transformer ○ Rich unit testing functionality ○ RocksDB integration supporting efficient range scan ● One Example Stateful Workload: 4.5G BPS, 2.5M EPS ingress, Twitter Blog ● Limitation (improved in 2.5) ○ Scalability: The StreamAssignor group message size limit the maximum size of group member. ○ Stateful: We have to develop our own static assignment to ensure the performance of stateful workload
  15. 15. Kerberos Authentication and SSL Encryption ● Authentication: SASL/GSSAPI (Kerberos) Requires keytab available during connection setup. ● Encryption: SSL The perf impact is noticeable, especially on JDK8. ● Authorization: Plugin (to be implemented)
  16. 16. Disk failure and unclean leader election ● Disk failure, sometimes, can cause the offline partition and data loss ● In Kafka 2.5, it happens less frequently as replica.lag.time.max.ms = 30 ● KIP 501: Avoid offline partitions
  17. 17. Replica Rebalance ● Cluster expansion and disk rebalance is very slow process, need to plan ahead ● Implement Kafka Auto Rebalance Service ● Disk rebalance between disk: ○ JBOD is better than RAID0 for the disk latency, ○ Currently there are problems to move the replicas between disks (should be fixed in 2.7) ● KIP 405: tiered storage
  18. 18. Kafka Ecosystem at Twitter KafkaKafka Kafka KafkaKafka Kafka Heron Hadoop Kafka Streams Kafka Consumer HDFS Replicator Cross DC/Cluster Replicator Filtering Service Kafka Self Serve & Provisioning Service Kafka Producer Scribe/Flume Kafka Auto Balancer Observer Customer Onboarding Produce Consume Realtime Processing Batch Processing External MonitoringKick rebalance after cluster expansion/shrink
  19. 19. Thank you !

×