Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1
Streaming in Practice
Putting Apache Kafka in Production
Roger Hoover, Engineer, Confluent
2
Apache Kafka: Online Talk Series
Part 1: September 27 Part 2: October 6 Part 3: October 27
Part 4: November 17 Part 6: D...
3
Agenda
• Kafka Basics
• Tuning Kafka For Your Application
• Data Balancing
• Spanning Multiple Datacenters
4
Agenda
• Kafka Basics
• Tuning Kafka For Your Application
• Data Balancing
• Spanning Multiple Datacenters
5
6
Architecture
Kafka cluster
broker 1
…
producer producer producer
consum
er
consum
er
broker 2 broker n topic partition
s...
7
Operations
• Simple Deployment
• Rolling Upgrades
• Good metrics for component monitoring
8
Agenda
• Kafka Basics
• Tuning Kafka For Your Application
• Data Balancing
• Spanning Multiple Datacenters
9
Two Example Apps
• User activity tracking
• Collect page view events while users are browsing
our web and mobile storefr...
10
Application Priorities
• User activity tracking
• High throughput (100x the sales stream)
• Availability is most import...
11
Knobs
- Partition count
- Replication factor
- Retention
- Batching + compression
- Producer send acknowledgements
- Mi...
12
Partition Count
- Partitions are the unit of consumer parallelism
- Over-partition your topics (especially keyed topics...
13
Partition Count
- High Throughput (User activity tracking)
- Large number of partitions (~100)
- Fewer Resources (Inven...
14
Replication Factor
- More replicas require more storage, disk I/O, and network bandwidth
- More replicas can tolerate m...
15
Replication Factor
- Lower cost (User activity tracking)
- replication.factor = 2
- High Fault Tolerance (Inventory adj...
16
Retention
- Retention time can be set per topic
- Longer retention times require more storage (imagine that!)
- Longer ...
17
Retention
- Less Storage (User activity tracking)
- log.retention.hours=72 (3 days)
- Longer Time Travel (Inventory adj...
18
Side-note: Time Travel
- Kafka 0.10.1 supports rewinding by time
- E.g. “Rewind to 10 minutes ago”
19
Batching & Compression
- Producer: batch.size, linger.ms, compression.type
- Consumer: fetch.min.bytes, fetch.wait.max....
20
Batching & Compression
- High throughput (User activity tracking)
- Producer: compression.type=lz4, batch.size (256KB),...
21
Producer Acknowledgements on Send
broker 1
producer
leader
broker 2
follower
broker 3
follower
4
2
2
3
commit
ack
When ...
22
Producer Acknowledgements on Send
- Throughput++ (User activity tracking)
- acks = 1
- Durability++ (Inventory adjustme...
23
In-Sync Replicas (ISRs)
broker 1
producer
leader
broker 2
follower
broker 3
follower
2
2
topic1-part1 topic1-part1 topi...
24
Minimum In-Sync Replicas
broker 1
producer
leader
broker 2
follower
broker 3
topic1-part1 topic1-part1 topic1-part1
m1 ...
25
Minimum In-Sync Replicas
- Availability++ (User activity tracking)
- min.insync.replicas = 1
- Durability++ (Inventory ...
26
Unclean Leader Election
- Topic config to tell Kafka how to handle topic leadership during severe outages
(rare)
- Allo...
27
Unclean Leader Election
- Availability++ (User activity tracking)
- unclean.leader.election.enable = true
- Durability+...
28
Mission Critical Data
- Producer acknowledgments
- acks=all
- Replication factor
- replication.factor = 3
- Minimum ISR...
29
Agenda
• Kafka Basics
• Tuning Kafka For Your Application
• Data Balancing
• Spanning Multiple Datacenters
30
Replica Placement
• Partitions are replicated
• Replicas are spread evenly across the cluster
• Only when the topic is ...
31
Replica Placement
• Over time broker load and storage become unbalanced
• Initial replica placement does not account fo...
32
Replica Reassignment
• Create plan to rebalance replicas
• Upload new assignment to the cluster
• Kafka migrates replic...
33
Data Balancing: Tricky Parts
• Creating a good plan
• Balance broker disk space
• Balance broker load
• Minimize data m...
34
Data Balancing: Solutions
• DIY
• kafka-reassign-partitions.sh script in Apache Kafka
• Confluent Enterprise Auto Data ...
35
Agenda
• Kafka Basics
• Tuning Kafka For Your Application
• Data Balancing
• Spanning Multiple Datacenters
36
Use cases
• Disaster Recovery
• Replicate data out to geo-localized data centers
• Aggregate data from other data cente...
37
Multi-DC: Two Approaches
• Stretched cluster
• Mirroring across clusters
38
Stretched Cluster
• Low-latency links between 3 DCs. Typically AZs in a single AWS region.
• Applications in all 3 DCs ...
39
Mirroring Across Clusters
• Separate Kafka clusters in each DC. Mirroring process copies data between them.
• Several v...
40
How to Mirror Across Clusters
• MirrorMaker tool in Apache Kafka
• Manual topic creation
• Manual sync of topic configu...
41
More Information: Tuning Tradeoffs
• Apache Kafka and Confluent Documentation
• When it Absolutely, Positively, Has to ...
42
More Information: Multi-DC
• Building Large Scale Stream Infrastructures Across Multiple Data Centers with Apache Kafka...
43
More Information: Metadata Management
• Yes, Virginia, You Really Do Need a Schema Registry
• Gwen Shapira - https://ww...
44
Thank you!
www.kafka-summit.org May 8, 2017
New York City
Hilton Midtown
August 28, 2017
San Francisco
Hilton
Union Squ...
Upcoming SlideShare
Loading in …5
×

Streaming in Practice - Putting Apache Kafka in Production

3,403 views

Published on

This presentation focuses on how to integrate all these components into an enterprise environment and what things you need to consider as you move into production.
We will touch on the following topics:
- Patterns for integrating with existing data systems and applications
- Metadata management at enterprise scale
- Tradeoffs in performance, cost, availability and fault tolerance
- Choosing which cross-datacenter replication patterns fit with your application
- Considerations for operating Kafka-based data pipelines in production

Published in: Software
  • Be the first to comment

Streaming in Practice - Putting Apache Kafka in Production

  1. 1. 1 Streaming in Practice Putting Apache Kafka in Production Roger Hoover, Engineer, Confluent
  2. 2. 2 Apache Kafka: Online Talk Series Part 1: September 27 Part 2: October 6 Part 3: October 27 Part 4: November 17 Part 6: December 15Part 5: December 1 Introduction To Streaming Data and Stream Processing with Apache Kafka Deep Dive into Apache Kafka Demystifying Stream Processing with Apache Kafka Data Integration with Apache Kafka A Practical Guide to Selecting a Stream Processing Technology https://www.confluent.io/apache-kafka-talk-series/
  3. 3. 3 Agenda • Kafka Basics • Tuning Kafka For Your Application • Data Balancing • Spanning Multiple Datacenters
  4. 4. 4 Agenda • Kafka Basics • Tuning Kafka For Your Application • Data Balancing • Spanning Multiple Datacenters
  5. 5. 5
  6. 6. 6 Architecture Kafka cluster broker 1 … producer producer producer consum er consum er broker 2 broker n topic partition server 1 server 2 server 3 ZooKeeper cluster
  7. 7. 7 Operations • Simple Deployment • Rolling Upgrades • Good metrics for component monitoring
  8. 8. 8 Agenda • Kafka Basics • Tuning Kafka For Your Application • Data Balancing • Spanning Multiple Datacenters
  9. 9. 9 Two Example Apps • User activity tracking • Collect page view events while users are browsing our web and mobile storefronts • Persist the data to HDFS for subsequent use in recommendation engine • Inventory adjustments • Track sales, maintain inventory, and re-order on- demand
  10. 10. 10 Application Priorities • User activity tracking • High throughput (100x the sales stream) • Availability is most important • Low retention required - 3 days • Inventory adjustments • Relatively low throughput • Durability is most important • Long retention required – 6 months
  11. 11. 11 Knobs - Partition count - Replication factor - Retention - Batching + compression - Producer send acknowledgements - Minimum ISRs - Unclean Leader Election
  12. 12. 12 Partition Count - Partitions are the unit of consumer parallelism - Over-partition your topics (especially keyed topics) - Easy to add consumers but hard to add partitions for keyed topics - Kafka can support ~10s k partitions
  13. 13. 13 Partition Count - High Throughput (User activity tracking) - Large number of partitions (~100) - Fewer Resources (Inventory adjustments) - Smaller number of partitions (< 50)
  14. 14. 14 Replication Factor - More replicas require more storage, disk I/O, and network bandwidth - More replicas can tolerate more failures topic1-part1 logs broker 1 topic1-part2 logs broker 2 topic2-part2 topic2-part1 logs broker 3 topic1-part1 logs broker 4 topic1-part2 topic2-part2 topic1-part1 topic1-part2 topic2-part1 topic2-part2 topic2-part1
  15. 15. 15 Replication Factor - Lower cost (User activity tracking) - replication.factor = 2 - High Fault Tolerance (Inventory adjustments) - replication.factor = 3 - Defaults to 1
  16. 16. 16 Retention - Retention time can be set per topic - Longer retention times require more storage (imagine that!) - Longer retention allows consumers to rewind further back in time - Part of the consumer’s SLA!
  17. 17. 17 Retention - Less Storage (User activity tracking) - log.retention.hours=72 (3 days) - Longer Time Travel (Inventory adjustments) - log.retention.hours=4380 (6 months) - Default is 7 days
  18. 18. 18 Side-note: Time Travel - Kafka 0.10.1 supports rewinding by time - E.g. “Rewind to 10 minutes ago”
  19. 19. 19 Batching & Compression - Producer: batch.size, linger.ms, compression.type - Consumer: fetch.min.bytes, fetch.wait.max.ms compressed batch 1send() send() send() send() producer async flush poll()compressed batch 2 compressed batch 3 compressed batch 1 compressed batch 2 compressed batch 3 consumerbroker
  20. 20. 20 Batching & Compression - High throughput (User activity tracking) - Producer: compression.type=lz4, batch.size (256KB), linger.ms (~10ms) or flush manually - Consumer: fetch.min.bytes (256KB), fetch.wait.max.ms (~10ms) - Low latency (Inventory adjustments) - Producer: linger.ms=0 - Consumer: fetch.min.bytes=1 - Defaults - compression.type = none - linger.ms = 0 (i.e. send immediately) - fetch.min.bytes = 1 (i.e. receive immediately)
  21. 21. 21 Producer Acknowledgements on Send broker 1 producer leader broker 2 follower broker 3 follower 4 2 2 3 commit ack When producer receives ack Latency Durability on failures acks=0 (no ack) no network delay some data loss acks=1 (wait for leader) 1 network roundtrip a few data loss acks=all (wait for committed) 2 network roundtrips no data loss topic1-part1 topic1-part1 topic1-part1 consumer 1
  22. 22. 22 Producer Acknowledgements on Send - Throughput++ (User activity tracking) - acks = 1 - Durability++ (Inventory adjustments) - acks = all - Default - acks = 1
  23. 23. 23 In-Sync Replicas (ISRs) broker 1 producer leader broker 2 follower broker 3 follower 2 2 topic1-part1 topic1-part1 topic1-part1 1 m1 m1 m1 m2 m2 m2 ISR last committed m2, m1 In-sync : replica reads from leader’s log end within replica.lag.time.max.ms
  24. 24. 24 Minimum In-Sync Replicas broker 1 producer leader broker 2 follower broker 3 topic1-part1 topic1-part1 topic1-part1 m1 m1 m1 m2 m2 m2 ISR m3 m4last committed m5 follower - Topic config to tell Kafka how to handle writes during severe outages (rare) - Leader will reject writes if the ISR count is too small topic1: min.insync.replicas=2
  25. 25. 25 Minimum In-Sync Replicas - Availability++ (User activity tracking) - min.insync.replicas = 1 - Durability++ (Inventory adjustments) - min.insync.replicas = 2 - Defaults to 1
  26. 26. 26 Unclean Leader Election - Topic config to tell Kafka how to handle topic leadership during severe outages (rare) - Allows automatic recovery in exchange for losing data m5 broker 1 producer leader ??? broker 2 leader broker 3 2 topic1-part1 topic1-part1 topic1-part1 1 m1 m1 m1 m2 m2 m2 ISR m3 m3 m4 m4last committed m3 follower m4 m5
  27. 27. 27 Unclean Leader Election - Availability++ (User activity tracking) - unclean.leader.election.enable = true - Durability++ (Inventory adjustments) - unclean.leader.election.enable = false - Defaults to true
  28. 28. 28 Mission Critical Data - Producer acknowledgments - acks=all - Replication factor - replication.factor = 3 - Minimum ISRs - min.insync.replicas = 2 - Unclean Leader Election - unclean.leader.election.enable = false
  29. 29. 29 Agenda • Kafka Basics • Tuning Kafka For Your Application • Data Balancing • Spanning Multiple Datacenters
  30. 30. 30 Replica Placement • Partitions are replicated • Replicas are spread evenly across the cluster • Only when the topic is created or modified topic1-part1 logs broker 1 topic1-part2 logs broker 2 topic2-part2 topic2-part1 logs broker 3 topic1-part1 logs broker 4 topic1-part2 topic2-part2 topic1-part1 topic1-part2 topic2-part1 topic2-part2 topic2-part1
  31. 31. 31 Replica Placement • Over time broker load and storage become unbalanced • Initial replica placement does not account for topic throughput or retention • Adding or removing brokers topic1-part1 broker 1 topic1-part2 broker 2 topic2-part2 topic2-part1 broker 3 topic1-part1 broker 4 topic1-part2 topic2-part2 topic1-part1 topic1-part2 topic2-part1 topic2-part2 topic2-part1 broker 5
  32. 32. 32 Replica Reassignment • Create plan to rebalance replicas • Upload new assignment to the cluster • Kafka migrates replicas without disruption topic1-part1 broker 1 topic1-part2 broker 2 topic2-part2 topic2-part1 broker 3 topic1-part1 broker 4 topic1-part2 topic1-part1 topic1-part2 topic2-part1 topic2-part2 broker 5 topic2-part1 topic2-part2 topic1-part1 broker 1 topic1-part2 broker 2 topic2-part2 topic2-part1 broker 3 topic1-part1 broker 4 topic1-part2 topic2-part2 topic1-part1 topic1-part2 topic2-part1 topic2-part2 topic2-part1 broker 5 Before After
  33. 33. 33 Data Balancing: Tricky Parts • Creating a good plan • Balance broker disk space • Balance broker load • Minimize data movement • Preserve rack placement • Movement of replicas can overload I/O and bandwidth resources • Use replication quota feature in 0.10.1
  34. 34. 34 Data Balancing: Solutions • DIY • kafka-reassign-partitions.sh script in Apache Kafka • Confluent Enterprise Auto Data Balancing • Optimizes storage utilization • Rack awareness and minimal data movement • Leverages replication quotas during rebalance
  35. 35. 35 Agenda • Kafka Basics • Tuning Kafka For Your Application • Data Balancing • Spanning Multiple Datacenters
  36. 36. 36 Use cases • Disaster Recovery • Replicate data out to geo-localized data centers • Aggregate data from other data centers for analysis • Part of hybrid cloud or cloud migration strategy
  37. 37. 37 Multi-DC: Two Approaches • Stretched cluster • Mirroring across clusters
  38. 38. 38 Stretched Cluster • Low-latency links between 3 DCs. Typically AZs in a single AWS region. • Applications in all 3 DCs share the same cluster and handle failures automatically. • Relies on intra-cluster replication to copy data across DCs (replication.factor >= 3) • Use rack awareness in Kafka 0.10; manual partition placement otherwise Kafka producers consumer s AZ 1 AZ 3AZ 2 producersproducers consumer s consumer s AWS Region
  39. 39. 39 Mirroring Across Clusters • Separate Kafka clusters in each DC. Mirroring process copies data between them. • Several variations of this pattern. Some require manual intervention on failover and recovery.
  40. 40. 40 How to Mirror Across Clusters • MirrorMaker tool in Apache Kafka • Manual topic creation • Manual sync of topic configuration • Confluent Enterprise Multi-DC • Dynamic topic creation at the destination • Automatic sync for topic configurations (including access controls) • Can be configured and managed from the Control Center UI • Leverages Connect API
  41. 41. 41 More Information: Tuning Tradeoffs • Apache Kafka and Confluent Documentation • When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka • Gwen Shapira and Jeff Holoman - https://www.confluent.io/kafka-summit-2016-ops-when-it-absolutely- positively-has-to-be-there/ • Chapter 6: Reliability Guarantees • Neha Narkhede, Gwen Shapira, Todd Palino – Kafka: The Definitive Guide • Confluent Operations Training
  42. 42. 42 More Information: Multi-DC • Building Large Scale Stream Infrastructures Across Multiple Data Centers with Apache Kafka – Jun Rao • Video: https://www.youtube.com/watch?v=XcvHmqmh16g • Slides: http://www.slideshare.net/HadoopSummit/building-largescale-stream-infrastructures- across-multiple-data-centers-with-apache-kafka • Confluent Enterprise Multi-DC - https://www.confluent.io/product/multi-datacenter/
  43. 43. 43 More Information: Metadata Management • Yes, Virginia, You Really Do Need a Schema Registry • Gwen Shapira - https://www.confluent.io/blog/schema-registry-kafka-stream- processing-yes-virginia-you-really-need-one/
  44. 44. 44 Thank you! www.kafka-summit.org May 8, 2017 New York City Hilton Midtown August 28, 2017 San Francisco Hilton Union Square

×