Streaming in Practice - Putting Apache Kafka in Production

816 views

Published on

This presentation focuses on how to integrate all these components into an enterprise environment and what things you need to consider as you move into production.
We will touch on the following topics:
- Patterns for integrating with existing data systems and applications
- Metadata management at enterprise scale
- Tradeoffs in performance, cost, availability and fault tolerance
- Choosing which cross-datacenter replication patterns fit with your application
- Considerations for operating Kafka-based data pipelines in production

Published in: Software
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
816
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
48
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • I’m an engineer at Confluent. In a previous job, I’ve taken Kafka from proof of concept all the way to production with some pipelines handling more than 5B events per day. My goal is to share what I think are the most important things to know when taking Kafka to production.

    This is the last talk in a series of 6. The previous talks cover components of the Kafka ecosystem and stream processing in general. This talk is about taking Kafka to production. In general, I think Kafka is pretty easy to operate and has great documentation compared to other technologies I’ve worked with. Since we cannot cover everything, I want to focus on the important concepts and hopefully give you enough insight that you what to plan for and where to find more information.

    Patterns for integrating with existing data systems and applications – covered in a previous talk
    Metadata management at enterprise scale – I’ll include a link at the end to a great blog post by Gwen Shapira
  • - Review the basics
    Talk about tuning Kafka – what tradeoffs you can make
    Data balancing
    Spanning Multiple Datacenters
  • First, I want to review a few basics of Kafka to make sure we have enough context for the rest of the talk. This is be a quick review if you’ve seen other talks in the series. If not, that’s ok. You should be able to follow along.
  • Kafka is a streaming platform. A streaming platform can be THE common point of data integration across an organization. It allows the teams/systems within the organization share data in realtime and react as fast as necessary. It allows teams to work together without tight coupling of their services.

    Kafka has some key characteristics that make it well-suited to being a streaming platform:
    First it scales well and cheaply. Very efficient. You can do hundreds of MB/sec of writes per server and can have many servers.
    Kafka doesn’t get slower as you store more data in it --- this is a huge win if you’ve operated other data systems
    Distributed by design – replication, fault tolerance, partitioning, elastic scaling
    Strong guarantees around ordering and durability
    Has some unique features such as compacted topics that lets it handle some unique cases.
    Has enterprise features like fine-grain security controls
  • Brokers hold data
    Topics are logical streams that are broken into partitions
    Partitions are the unit of parallelism for consumers
    Topic-partitions are replicated and spread over multiple brokers
    Producers and consumers read from brokers
    Kafka relies on ZooKeeper for it’s own internal cluster management

  • Deployment is pretty easy. There are only a few compoents. They run as JVM processes.
    No downtime
  • In distributed systems, there are tradeoffs. The goal of this section is to highlight the tradeoffs you can make to tune Kafka to match with your application's priorities. Getting the most out of Kafka.

  • You can imagine that we have a demand model to - match supply with demand while
    - keeping our inventory as low as possible
  • What knobs do we have in Kafka to match these priorities?
  • We’re going to look at each of these and how to apply them to our example applications
  • Resources vs. Throughput
  • F replicas can tolerate f-1 failures
    Topic1 has 3 replicas in this example spread over different brokers

    Cost vs. Availability
  • F replicas can tolerate f-1 failures
  • Storage vs. Time Travel
  • Storage vs. Time Travel
  • Latency vs. Throughput

    My experiments showed 4x compression ratio with lz4 even with Avro data
  • Latency vs. Throughput

    Compression works on compacted topics now too
  • Latency vs. Durability
  • Latency vs. Durability
  • Before we get to the next knob, we need to review the idea of In-Sync Replicas.
  • Availability vs. Durability
  • Availability vs. Durability
  • Availability vs. Durability

    This should be very rare but in a severe outage situation, some applications prefer automatic recover even if data is lost
  • Availability vs. Durability
  • This is a good place to start and adjust down if you find you need to optimize further
    With batching and compression, you should be able to get very good throughput and safety
  • Now let’s assume that you’ve got a cluster up and running. You’ve setup good component-level monitoring so you tell that ZooKeeper and Kafka are healthy. You’ve tuned it for your application priorities. Kafka handles failures very smoothly and requires little attention. I’ve run it in production at a previous job and had a broker die without any interruption to the application (handling > 4B events/day).

    However, there is some maintenance that you have to do and it’s around data balancing so I think it’s important to understand this and plan for it.
  • In the example
    Broker 2 is under-utilized and Broker 5 is not being utilized at all

  • We’ve heard from many customers that this is a pain point

    - Took us 2 weeks
  • We’ve talked about running Kafka reliably in a single data center. Another important consideration for putting Kafka in production is how to handle multiple datacenters. The topic is too deep to cover in detail so the goal of this section is to give an introduction and motivate you too watch an excellent talk on this subject by Confluent Co-Founder Jun Rao.

  • Simplest setup for failure handling but does not work across regions
  • There are a number of variations to this and some of them require manual intervention on failure recovery. Details in Jun’s talk. Please watch it. Over time, you’ll probably need to support multiple replication patterns to match different use cases.

    This picture shows an example of 1) aggregating data from other DCs for analytics and 2) cross-replicating between DCs so they can both see each others data.
  • The goal of this section is to highlight the tradeoffs you can make to align Kafka with your application's priorities
  • Streaming in Practice - Putting Apache Kafka in Production

    1. 1. 1 Streaming in Practice Putting Apache Kafka in Production Roger Hoover, Engineer, Confluent
    2. 2. 2 Apache Kafka: Online Talk Series Part 1: September 27 Part 2: October 6 Part 3: October 27 Part 4: November 17 Part 6: December 15Part 5: December 1 Introduction To Streaming Data and Stream Processing with Apache Kafka Deep Dive into Apache Kafka Demystifying Stream Processing with Apache Kafka Data Integration with Apache Kafka A Practical Guide to Selecting a Stream Processing Technology https://www.confluent.io/apache-kafka-talk-series/
    3. 3. 3 Agenda • Kafka Basics • Tuning Kafka For Your Application • Data Balancing • Spanning Multiple Datacenters
    4. 4. 4 Agenda • Kafka Basics • Tuning Kafka For Your Application • Data Balancing • Spanning Multiple Datacenters
    5. 5. 5
    6. 6. 6 Architecture Kafka cluster broker 1 … producer producer producer consum er consum er broker 2 broker n topic partition server 1 server 2 server 3 ZooKeeper cluster
    7. 7. 7 Operations • Simple Deployment • Rolling Upgrades • Good metrics for component monitoring
    8. 8. 8 Agenda • Kafka Basics • Tuning Kafka For Your Application • Data Balancing • Spanning Multiple Datacenters
    9. 9. 9 Two Example Apps • User activity tracking • Collect page view events while users are browsing our web and mobile storefronts • Persist the data to HDFS for subsequent use in recommendation engine • Inventory adjustments • Track sales, maintain inventory, and re-order on- demand
    10. 10. 10 Application Priorities • User activity tracking • High throughput (100x the sales stream) • Availability is most important • Low retention required - 3 days • Inventory adjustments • Relatively low throughput • Durability is most important • Long retention required – 6 months
    11. 11. 11 Knobs - Partition count - Replication factor - Retention - Batching + compression - Producer send acknowledgements - Minimum ISRs - Unclean Leader Election
    12. 12. 12 Partition Count - Partitions are the unit of consumer parallelism - Over-partition your topics (especially keyed topics) - Easy to add consumers but hard to add partitions for keyed topics - Kafka can support ~10s k partitions
    13. 13. 13 Partition Count - High Throughput (User activity tracking) - Large number of partitions (~100) - Fewer Resources (Inventory adjustments) - Smaller number of partitions (< 50)
    14. 14. 14 Replication Factor - More replicas require more storage, disk I/O, and network bandwidth - More replicas can tolerate more failures topic1-part1 logs broker 1 topic1-part2 logs broker 2 topic2-part2 topic2-part1 logs broker 3 topic1-part1 logs broker 4 topic1-part2 topic2-part2 topic1-part1 topic1-part2 topic2-part1 topic2-part2 topic2-part1
    15. 15. 15 Replication Factor - Lower cost (User activity tracking) - replication.factor = 2 - High Fault Tolerance (Inventory adjustments) - replication.factor = 3 - Defaults to 1
    16. 16. 16 Retention - Retention time can be set per topic - Longer retention times require more storage (imagine that!) - Longer retention allows consumers to rewind further back in time - Part of the consumer’s SLA!
    17. 17. 17 Retention - Less Storage (User activity tracking) - log.retention.hours=72 (3 days) - Longer Time Travel (Inventory adjustments) - log.retention.hours=4380 (6 months) - Default is 7 days
    18. 18. 18 Side-note: Time Travel - Kafka 0.10.1 supports rewinding by time - E.g. “Rewind to 10 minutes ago”
    19. 19. 19 Batching & Compression - Producer: batch.size, linger.ms, compression.type - Consumer: fetch.min.bytes, fetch.wait.max.ms compressed batch 1send() send() send() send() producer async flush poll()compressed batch 2 compressed batch 3 compressed batch 1 compressed batch 2 compressed batch 3 consumerbroker
    20. 20. 20 Batching & Compression - High throughput (User activity tracking) - Producer: compression.type=lz4, batch.size (256KB), linger.ms (~10ms) or flush manually - Consumer: fetch.min.bytes (256KB), fetch.wait.max.ms (~10ms) - Low latency (Inventory adjustments) - Producer: linger.ms=0 - Consumer: fetch.min.bytes=1 - Defaults - compression.type = none - linger.ms = 0 (i.e. send immediately) - fetch.min.bytes = 1 (i.e. receive immediately)
    21. 21. 21 Producer Acknowledgements on Send broker 1 producer leader broker 2 follower broker 3 follower 4 2 2 3 commit ack When producer receives ack Latency Durability on failures acks=0 (no ack) no network delay some data loss acks=1 (wait for leader) 1 network roundtrip a few data loss acks=all (wait for committed) 2 network roundtrips no data loss topic1-part1 topic1-part1 topic1-part1 consumer 1
    22. 22. 22 Producer Acknowledgements on Send - Throughput++ (User activity tracking) - acks = 1 - Durability++ (Inventory adjustments) - acks = all - Default - acks = 1
    23. 23. 23 In-Sync Replicas (ISRs) broker 1 producer leader broker 2 follower broker 3 follower 2 2 topic1-part1 topic1-part1 topic1-part1 1 m1 m1 m1 m2 m2 m2 ISR last committed m2, m1 In-sync : replica reads from leader’s log end within replica.lag.time.max.ms
    24. 24. 24 Minimum In-Sync Replicas broker 1 producer leader broker 2 follower broker 3 topic1-part1 topic1-part1 topic1-part1 m1 m1 m1 m2 m2 m2 ISR m3 m4last committed m5 follower - Topic config to tell Kafka how to handle writes during severe outages (rare) - Leader will reject writes if the ISR count is too small topic1: min.insync.replicas=2
    25. 25. 25 Minimum In-Sync Replicas - Availability++ (User activity tracking) - min.insync.replicas = 1 - Durability++ (Inventory adjustments) - min.insync.replicas = 2 - Defaults to 1
    26. 26. 26 Unclean Leader Election - Topic config to tell Kafka how to handle topic leadership during severe outages (rare) - Allows automatic recovery in exchange for losing data m5 broker 1 producer leader ??? broker 2 leader broker 3 2 topic1-part1 topic1-part1 topic1-part1 1 m1 m1 m1 m2 m2 m2 ISR m3 m3 m4 m4last committed m3 follower m4 m5
    27. 27. 27 Unclean Leader Election - Availability++ (User activity tracking) - unclean.leader.election.enable = true - Durability++ (Inventory adjustments) - unclean.leader.election.enable = false - Defaults to true
    28. 28. 28 Mission Critical Data - Producer acknowledgments - acks=all - Replication factor - replication.factor = 3 - Minimum ISRs - min.insync.replicas = 2 - Unclean Leader Election - unclean.leader.election.enable = false
    29. 29. 29 Agenda • Kafka Basics • Tuning Kafka For Your Application • Data Balancing • Spanning Multiple Datacenters
    30. 30. 30 Replica Placement • Partitions are replicated • Replicas are spread evenly across the cluster • Only when the topic is created or modified topic1-part1 logs broker 1 topic1-part2 logs broker 2 topic2-part2 topic2-part1 logs broker 3 topic1-part1 logs broker 4 topic1-part2 topic2-part2 topic1-part1 topic1-part2 topic2-part1 topic2-part2 topic2-part1
    31. 31. 31 Replica Placement • Over time broker load and storage become unbalanced • Initial replica placement does not account for topic throughput or retention • Adding or removing brokers topic1-part1 broker 1 topic1-part2 broker 2 topic2-part2 topic2-part1 broker 3 topic1-part1 broker 4 topic1-part2 topic2-part2 topic1-part1 topic1-part2 topic2-part1 topic2-part2 topic2-part1 broker 5
    32. 32. 32 Replica Reassignment • Create plan to rebalance replicas • Upload new assignment to the cluster • Kafka migrates replicas without disruption topic1-part1 broker 1 topic1-part2 broker 2 topic2-part2 topic2-part1 broker 3 topic1-part1 broker 4 topic1-part2 topic1-part1 topic1-part2 topic2-part1 topic2-part2 broker 5 topic2-part1 topic2-part2 topic1-part1 broker 1 topic1-part2 broker 2 topic2-part2 topic2-part1 broker 3 topic1-part1 broker 4 topic1-part2 topic2-part2 topic1-part1 topic1-part2 topic2-part1 topic2-part2 topic2-part1 broker 5 Before After
    33. 33. 33 Data Balancing: Tricky Parts • Creating a good plan • Balance broker disk space • Balance broker load • Minimize data movement • Preserve rack placement • Movement of replicas can overload I/O and bandwidth resources • Use replication quota feature in 0.10.1
    34. 34. 34 Data Balancing: Solutions • DIY • kafka-reassign-partitions.sh script in Apache Kafka • Confluent Enterprise Auto Data Balancing • Optimizes storage utilization • Rack awareness and minimal data movement • Leverages replication quotas during rebalance
    35. 35. 35 Agenda • Kafka Basics • Tuning Kafka For Your Application • Data Balancing • Spanning Multiple Datacenters
    36. 36. 36 Use cases • Disaster Recovery • Replicate data out to geo-localized data centers • Aggregate data from other data centers for analysis • Part of hybrid cloud or cloud migration strategy
    37. 37. 37 Multi-DC: Two Approaches • Stretched cluster • Mirroring across clusters
    38. 38. 38 Stretched Cluster • Low-latency links between 3 DCs. Typically AZs in a single AWS region. • Applications in all 3 DCs share the same cluster and handle failures automatically. • Relies on intra-cluster replication to copy data across DCs (replication.factor >= 3) • Use rack awareness in Kafka 0.10; manual partition placement otherwise Kafka producers consumer s AZ 1 AZ 3AZ 2 producersproducers consumer s consumer s AWS Region
    39. 39. 39 Mirroring Across Clusters • Separate Kafka clusters in each DC. Mirroring process copies data between them. • Several variations of this pattern. Some require manual intervention on failover and recovery.
    40. 40. 40 How to Mirror Across Clusters • MirrorMaker tool in Apache Kafka • Manual topic creation • Manual sync of topic configuration • Confluent Enterprise Multi-DC • Dynamic topic creation at the destination • Automatic sync for topic configurations (including access controls) • Can be configured and managed from the Control Center UI • Leverages Connect API
    41. 41. 41 More Information: Tuning Tradeoffs • Apache Kafka and Confluent Documentation • When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka • Gwen Shapira and Jeff Holoman - https://www.confluent.io/kafka-summit-2016-ops-when-it-absolutely- positively-has-to-be-there/ • Chapter 6: Reliability Guarantees • Neha Narkhede, Gwen Shapira, Todd Palino – Kafka: The Definitive Guide • Confluent Operations Training
    42. 42. 42 More Information: Multi-DC • Building Large Scale Stream Infrastructures Across Multiple Data Centers with Apache Kafka – Jun Rao • Video: https://www.youtube.com/watch?v=XcvHmqmh16g • Slides: http://www.slideshare.net/HadoopSummit/building-largescale-stream-infrastructures- across-multiple-data-centers-with-apache-kafka • Confluent Enterprise Multi-DC - https://www.confluent.io/product/multi-datacenter/
    43. 43. 43 More Information: Metadata Management • Yes, Virginia, You Really Do Need a Schema Registry • Gwen Shapira - https://www.confluent.io/blog/schema-registry-kafka-stream- processing-yes-virginia-you-really-need-one/
    44. 44. 44 Thank you! www.kafka-summit.org May 8, 2017 New York City Hilton Midtown August 28, 2017 San Francisco Hilton Union Square

    ×