Apache Kafka lessons learned @PAYBACK

132 views

Published on

We share our experience with Apache Kafka for event sourcing in microservices-based architecture. Talk was a part of Meetup: https://www.meetup.com/de-DE/Apache-Kafka-Germany-Munich/events/236402498/

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
132
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
3
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Apache Kafka lessons learned @PAYBACK

  1. 1. Kafka lessons learned @PAYBACK Munich, 2017
  2. 2. https://quotefancy.com/
  3. 3. 3 PAYBACK Global – One Global Platform for 3 markets… Apache Kafka Lessons Learned @ PAYBACK Monolithic CORE 3 tier JEE + Configuration > 100 Million Customers … 14 Tage 24 h Produkt- Backlog Sprint- Backlog Sprint Runnable Software deploy SIT UAT Partner test Staging / NFR Test Transition Go-Live Monitoring > 30 environments > 200 server > 100 artefacts Monthly Major Release
  4. 4. 4 Architecture Blueprint Apache Kafka Lessons Learned @ PAYBACK
  5. 5. 5 Our Use Case sorry, it's not big data (yet) Apache Kafka Lessons Learned @ PAYBACK
  6. 6. 6 Orchestration vs. Choreography – Business Process Apache Kafka Lessons Learned @ PAYBACK Sam Newman 2015, Building Microservicess, O'Reilly
  7. 7. 7 Orchestration: Synchronous Apache Kafka Lessons Learned @ PAYBACK Sam Newman 2015, Building Microservicess, O'Reilly • Easy to map code to business process • Immediate Feedback about every stage • Atomic Execution • Customer Service becomes central place of logic • Leads to "God" Services • Tight coupling, high cost of changes • Resilience is complex (think retries, scaling…) PRO CON
  8. 8. 8 Choreography: Asynchronous, Event Sourcing Apache Kafka Lessons Learned @ PAYBACK *Sam Newman 2015, Building Microservicess, O'Reilly • Easier to achieve ResilienceResilienceResilienceResilience and Performance • More decoupled • distributed logic • Higher flexibility (changes, scaling) • Higher implementation effort & complexity • Additional work for monitoring and tracking • Additional SPOF PRO CON
  9. 9. 9 Resilience concerns the whole system Lose coupling helps implement resilience patterns, but you need to care about: ○ delivery and processing semantics ○ retries and fallback strategy ○ handle timeouts and other communication errors ○ transaction handling ○ no silver bullet pattern for all event types Apache Kafka Lessons Learned @ PAYBACK Resilience is about an ability to fully recover from failure - to self-heal
  10. 10. 10 Choosing the right tool ○ NFRs may be specific to the Event type ○ Delivery semantic depend on Event type - at most once - at least once - exactly once ○ Events order for some use cases can be important (FIFO) ○ Reprocessing must be possible ○ Monitoring and alerting must be well supported (APIs) due to the increased complexity ○ … Apache Kafka Lessons Learned @ PAYBACK We need to consider
  11. 11. 11 Apache Kafka Lessons Learned @ PAYBACK pub&sub, high throughput, low latency, scalable, centralized, real-time
  12. 12. 12 I have a joke about an event… Apache Kafka Lessons Learned @ PAYBACK …But you might not get it INCIDENTS
  13. 13. 13 Cluster outage Apache Kafka Lessons Learned @ PAYBACK -VMs stalled during snapshot backupsbackupsbackupsbackups leading to Cluster reconnects -in 9/10 cases recovery worked -in 1/10 cases this lead to a single broker outside the cluster which still had partitions assigned (luckily refused writes because of missing replicas) Deactivate Backups! Consider physical machines!
  14. 14. 14 "A first sign of the beginning of understanding is the wish to die. " Franz Kafka Apache Kafka Lessons Learned @ PAYBACK Von Atelier Jacobi: Sigismund Jacobi (1860–1935) - http://www.bodleian.ox.ac.uk/news/2008_july_02, Gemeinfrei, https://commons.wikimedia.org/w/index.php?curid=5428566
  15. 15. 15 Configuration and implementation is complex Apache Kafka Lessons Learned @ PAYBACK
  16. 16. 16 Producer ○ Producer uses non-blocking async API ○ Tow options for checking for failures: - Immediately block for response: send().get() - Do followup work in Callback - Be careful about handling failures ○ Don’t forget to close the producer! producer.close() will block until in-flight transactions complete ○ acks – set to all ○ batch.size – set to 0 ○ retries (defaults to 0) - think about increasing this value - Not all errors are automatically retriable . Think about custom error handling on producer side! - retry may affect message ordering Apache Kafka Lessons Learned @ PAYBACK Implementation Configuration
  17. 17. 17 Consumer o Note: Consumer is single threaded – one consumer per thread o disable auto commit (autocommit.enable = false) o commit using OffsetAndMetadata and not committing everything o rollback with seek -> you need to know your last committed message -> implement Rebalance Listener o rollback (seek) after errors in offset commit o change default max.partition.fetch.bytes (1MB can lead to session timeout in < 0.10.X) o event processing should be idempotent – be prepared to handle duplicates o think about event reprocessing (how to change offset, how to recreate event etc) Apache Kafka Lessons Learned @ PAYBACK Recommendations
  18. 18. 18 Other basic configuration o Acks = all o Block.on.buffer.full = true o Producer Retries = MAX_INT o ( Max.inflight.requests.per.connect = 1 ) o Producer.close() o Replication-factor >= 3 o Min.insync.replicas = 2 o Unclean.leader.election = false o Auto.offset.commit = false o Commit after processing o Monitor! Apache Kafka Lessons Learned @ PAYBACK Be Safe, Not Sorry
  19. 19. 19 Monitoring Apache Kafka Lessons Learned @ PAYBACK http://www.spiegel.de/spiegel/print/d-129456859.html
  20. 20. 20 KafkaBrokerKafkaBroker KafkaBroker Timeseries Metrics to Graphite Apache Kafka Lessons Learned @ PAYBACK Metrics Library + Graphite Reporter Graphite Grafana KafkaConsumer Metrics Library + Graphite Reporter
  21. 21. 21 Kafka-Manager: Open Source UI/API Kafka Mgmt Tool Apache Kafka Lessons Learned @ PAYBACK https://github.com/yahoo/kafka-manager • Good for current cluster status and ad-hoc analysis • Provides a status API (HTTP) • Consumers only displayed during active consumption • 0.10.x support still not merged
  22. 22. 22 Kafka-Manager API Example Apache Kafka Lessons Learned @ PAYBACK curl –XGET http://kafka-manager/api/status/VP2/mdeAppGroup/groupSummary?consumerType=KF {memberDataChanges: {totalLag: 142, percentageCovered: 100, partitionOffsets: [1779279, 372957, 368100, 372415, 368349, 374649, 373262, 373934, 1775065, 373339, 369416, 374362, […]
  23. 23. 23 Burrow: API only Consumer Lag Checking Apache Kafka Lessons Learned @ PAYBACK { error: false, message: "consumer group status returned", status: { cluster: "vp2", group: "mdeAppGroup", status: "ERR", complete: false, partitions: [ { topic: "memberDataChanges", partition: 1, status: "STOP", start: { offset: 1775109, timestamp: 1485253978439, lag: 0 }, end: { offset: 1775127, timestamp: 1485254054861, lag: 1 } },}, […] totallag: 8 }, request: {url: "/v2/kafka/vp2/consumer/mdeAppGroup/lag", host: "hqiqlpxxap89", cluster: "vp2", group: "mdeAppGroup", topic: "" } } curl –XGET http://burrow/v2/kafka/vp2/consumer/mdeAppGroup/lag No Thresholds required Alerting via email and HTTP POST Issue: Calculate lag at request time, not commit time
  24. 24. 24 "God gives the nuts, but he does not crack them." Franz Kafka PAYBACK GmbH Maxim Schelest Thomas Falkenberg Theresienhöhe 12 80339 München Phone +49 (0) 89 997 41 – 0 PAYBACK.net | PAYBACK.de

×