Apache Kafka lessons learned @PAYBACK
Munich, 2017
https://quotefancy.com/
3
PAYBACK Global – One Global Platform for 3 markets…
Apache Kafka Lessons Learned @ PAYBACK
Monolithic CORE
3 tier JEE
+
Configuration
> 100 Million Customers
…
14
Tage
24
h
Produkt-
Backlog
Sprint-
Backlog
Sprint Runnable
Software
deploy
SIT
UAT
Partner
test
Staging / NFR Test
Transition Go-Live
Monitoring
> 30 environments
> 200 server
> 100 artefacts
Monthly Major Release
4
Architecture Blueprint
Apache Kafka Lessons Learned @ PAYBACK
5
Our Use Case
sorry, it's not big data (yet)
Apache Kafka Lessons Learned @ PAYBACK
6
Orchestration vs. Choreography – Business Process
Apache Kafka Lessons Learned @ PAYBACK
Sam Newman 2015, Building Microservicess, O'Reilly
7
Orchestration: Synchronous
Apache Kafka Lessons Learned @ PAYBACK
Sam Newman 2015, Building Microservicess, O'Reilly
• Easy to map code to business process
• Immediate Feedback about every stage
• Atomic Execution
• Customer Service becomes central place of logic
• Leads to "God" Services
• Tight coupling, high cost of changes
• Resilience is complex (think retries, scaling…)
PRO CON
8
Choreography: Asynchronous, Event-driven
Apache Kafka Lessons Learned @ PAYBACK
*Sam Newman 2015, Building Microservicess, O'Reilly
• Easier to achieve Resilience and Performance
• More decoupled
• distributed logic
• Higher flexibility (changes, scaling)
• Higher implementation effort & complexity
• Additional work for monitoring and tracking
• Additional SPOF
PRO CON
9
Resilience concerns the whole system
Lose coupling helps implement resilience patterns, but you need to care about:
○ delivery and processing semantics
○ retries and fallback strategy
○ handle timeouts and other communication errors
○ transaction handling
○ no silver bullet pattern for all event types
Apache Kafka Lessons Learned @ PAYBACK
Resilience is about an ability to fully recover from failure - to self-heal
10
Choosing the right tool
○ NFRs may be specific to the Event type
○ Delivery semantic depend on Event type
- at most once
- at least once
- exactly once
○ Events order for some use cases can be important (FIFO)
○ Reprocessing must be possible
○ Monitoring and alerting must be well supported (APIs) due to the increased complexity
○ …
Apache Kafka Lessons Learned @ PAYBACK
We need to consider
11
Apache Kafka Lessons Learned @ PAYBACK
pub&sub, high throughput, low latency, scalable, centralized, real-time
12
I have a joke about an event…
Apache Kafka Lessons Learned @ PAYBACK
…But you might not get it
INCIDENTS
13
Cluster outage
Apache Kafka Lessons Learned @ PAYBACK
-VMs stalled during snapshot backups leading to Cluster reconnects
-in 9/10 cases recovery worked
-in 1/10 cases this lead to a single broker outside the cluster which still had
partitions assigned (luckily refused writes because of missing replicas)
Deactivate Backups!
Consider physical machines!
14
"A first sign of the beginning of
understanding is the wish to die. "
Franz Kafka
Apache Kafka Lessons Learned @ PAYBACK
Von Atelier Jacobi: Sigismund Jacobi (1860–1935) - http://www.bodleian.ox.ac.uk/news/2008_july_02, Gemeinfrei, https://commons.wikimedia.org/w/index.php?curid=5428566
15
Configuration and
implementation
is complex
Apache Kafka Lessons Learned @ PAYBACK
16
Producer
○ Producer uses non-blocking async API
○ Tow options for checking for failures:
- Immediately block for response: send().get()
- Do followup work in Callback
- Be careful about handling failures
○ Don’t forget to close the producer! producer.close() will block until in-flight transactions complete
○ acks – set to all
○ batch.size – set to 0
○ retries (defaults to 0) - think about increasing this value
- Not all errors are automatically retriable . Think about custom error handling on producer side!
- retry may affect message ordering
Apache Kafka Lessons Learned @ PAYBACK
Implementation
Configuration
17
Consumer
o Note: Consumer is single threaded – one consumer per thread
o disable auto commit (autocommit.enable = false)
o commit using OffsetAndMetadata and not committing everything
o rollback with seek -> you need to know your last committed message -> implement Rebalance Listener
o rollback (seek) after errors in offset commit
o change default max.partition.fetch.bytes (1MB can lead to session timeout in < 0.10.X)
o event processing should be idempotent – be prepared to handle duplicates
o think about event reprocessing (how to change offset, how to recreate event etc)
Apache Kafka Lessons Learned @ PAYBACK
Recommendations
18
Other basic configuration
o Acks = all
o Block.on.buffer.full = true
o Producer Retries = MAX_INT
o ( Max.inflight.requests.per.connect = 1 )
o Producer.close()
o Replication-factor >= 3
o Min.insync.replicas = 2
o Unclean.leader.election = false
o Auto.offset.commit = false
o Commit after processing
o Monitor!
Apache Kafka Lessons Learned @ PAYBACK
Be Safe, Not Sorry
19
Monitoring
Apache Kafka Lessons Learned @ PAYBACK
http://www.spiegel.de/spiegel/print/d-129456859.html
20
KafkaBrokerKafkaBroker
KafkaBroker
Timeseries Metrics to Graphite
Apache Kafka Lessons Learned @ PAYBACK
Metrics Library +
Graphite Reporter
Graphite
Grafana
KafkaConsumer
Metrics Library +
Graphite Reporter
21
Kafka-Manager: Open Source UI/API Kafka Mgmt Tool
Apache Kafka Lessons Learned @ PAYBACK
https://github.com/yahoo/kafka-manager
• Good for current cluster status and
ad-hoc analysis
• Provides a status API (HTTP)
• Consumers only displayed during
active consumption
• 0.10.x support still not merged
22
Kafka-Manager API Example
Apache Kafka Lessons Learned @ PAYBACK
curl –XGET http://kafka-manager/api/status/VP2/mdeAppGroup/groupSummary?consumerType=KF
{memberDataChanges:
{totalLag: 142,
percentageCovered: 100,
partitionOffsets:
[1779279,
372957,
368100,
372415,
368349,
374649,
373262,
373934,
1775065,
373339,
369416,
374362,
[…]
23
Burrow: API only Consumer Lag Checking
Apache Kafka Lessons Learned @ PAYBACK
{
error: false,
message: "consumer group status returned",
status: {
cluster: "vp2",
group: "mdeAppGroup",
status: "ERR",
complete: false,
partitions: [
{
topic: "memberDataChanges",
partition: 1,
status: "STOP",
start: {
offset: 1775109,
timestamp: 1485253978439,
lag: 0
},
end: {
offset: 1775127,
timestamp: 1485254054861,
lag: 1
}
},},
[…]
totallag: 8
},
request:
{url: "/v2/kafka/vp2/consumer/mdeAppGroup/lag",
host: "hqiqlpxxap89",
cluster: "vp2",
group: "mdeAppGroup",
topic: ""
}
}
curl –XGET http://burrow/v2/kafka/vp2/consumer/mdeAppGroup/lag
No Thresholds required
Alerting via email and HTTP POST
Issue: Calculate lag at request time, not commit time
24
"God gives the nuts, but he does not crack them."
Franz Kafka
PAYBACK GmbH
Maxim Schelest
Thomas Falkenberg
Theresienhöhe 12
80339 München
Phone +49 (0) 89 997 41 – 0
PAYBACK.net | PAYBACK.de

Apache Kafka lessons learned @PAYBACK

  • 1.
    Apache Kafka lessonslearned @PAYBACK Munich, 2017
  • 2.
  • 3.
    3 PAYBACK Global –One Global Platform for 3 markets… Apache Kafka Lessons Learned @ PAYBACK Monolithic CORE 3 tier JEE + Configuration > 100 Million Customers … 14 Tage 24 h Produkt- Backlog Sprint- Backlog Sprint Runnable Software deploy SIT UAT Partner test Staging / NFR Test Transition Go-Live Monitoring > 30 environments > 200 server > 100 artefacts Monthly Major Release
  • 4.
    4 Architecture Blueprint Apache KafkaLessons Learned @ PAYBACK
  • 5.
    5 Our Use Case sorry,it's not big data (yet) Apache Kafka Lessons Learned @ PAYBACK
  • 6.
    6 Orchestration vs. Choreography– Business Process Apache Kafka Lessons Learned @ PAYBACK Sam Newman 2015, Building Microservicess, O'Reilly
  • 7.
    7 Orchestration: Synchronous Apache KafkaLessons Learned @ PAYBACK Sam Newman 2015, Building Microservicess, O'Reilly • Easy to map code to business process • Immediate Feedback about every stage • Atomic Execution • Customer Service becomes central place of logic • Leads to "God" Services • Tight coupling, high cost of changes • Resilience is complex (think retries, scaling…) PRO CON
  • 8.
    8 Choreography: Asynchronous, Event-driven ApacheKafka Lessons Learned @ PAYBACK *Sam Newman 2015, Building Microservicess, O'Reilly • Easier to achieve Resilience and Performance • More decoupled • distributed logic • Higher flexibility (changes, scaling) • Higher implementation effort & complexity • Additional work for monitoring and tracking • Additional SPOF PRO CON
  • 9.
    9 Resilience concerns thewhole system Lose coupling helps implement resilience patterns, but you need to care about: ○ delivery and processing semantics ○ retries and fallback strategy ○ handle timeouts and other communication errors ○ transaction handling ○ no silver bullet pattern for all event types Apache Kafka Lessons Learned @ PAYBACK Resilience is about an ability to fully recover from failure - to self-heal
  • 10.
    10 Choosing the righttool ○ NFRs may be specific to the Event type ○ Delivery semantic depend on Event type - at most once - at least once - exactly once ○ Events order for some use cases can be important (FIFO) ○ Reprocessing must be possible ○ Monitoring and alerting must be well supported (APIs) due to the increased complexity ○ … Apache Kafka Lessons Learned @ PAYBACK We need to consider
  • 11.
    11 Apache Kafka LessonsLearned @ PAYBACK pub&sub, high throughput, low latency, scalable, centralized, real-time
  • 12.
    12 I have ajoke about an event… Apache Kafka Lessons Learned @ PAYBACK …But you might not get it INCIDENTS
  • 13.
    13 Cluster outage Apache KafkaLessons Learned @ PAYBACK -VMs stalled during snapshot backups leading to Cluster reconnects -in 9/10 cases recovery worked -in 1/10 cases this lead to a single broker outside the cluster which still had partitions assigned (luckily refused writes because of missing replicas) Deactivate Backups! Consider physical machines!
  • 14.
    14 "A first signof the beginning of understanding is the wish to die. " Franz Kafka Apache Kafka Lessons Learned @ PAYBACK Von Atelier Jacobi: Sigismund Jacobi (1860–1935) - http://www.bodleian.ox.ac.uk/news/2008_july_02, Gemeinfrei, https://commons.wikimedia.org/w/index.php?curid=5428566
  • 15.
  • 16.
    16 Producer ○ Producer usesnon-blocking async API ○ Tow options for checking for failures: - Immediately block for response: send().get() - Do followup work in Callback - Be careful about handling failures ○ Don’t forget to close the producer! producer.close() will block until in-flight transactions complete ○ acks – set to all ○ batch.size – set to 0 ○ retries (defaults to 0) - think about increasing this value - Not all errors are automatically retriable . Think about custom error handling on producer side! - retry may affect message ordering Apache Kafka Lessons Learned @ PAYBACK Implementation Configuration
  • 17.
    17 Consumer o Note: Consumeris single threaded – one consumer per thread o disable auto commit (autocommit.enable = false) o commit using OffsetAndMetadata and not committing everything o rollback with seek -> you need to know your last committed message -> implement Rebalance Listener o rollback (seek) after errors in offset commit o change default max.partition.fetch.bytes (1MB can lead to session timeout in < 0.10.X) o event processing should be idempotent – be prepared to handle duplicates o think about event reprocessing (how to change offset, how to recreate event etc) Apache Kafka Lessons Learned @ PAYBACK Recommendations
  • 18.
    18 Other basic configuration oAcks = all o Block.on.buffer.full = true o Producer Retries = MAX_INT o ( Max.inflight.requests.per.connect = 1 ) o Producer.close() o Replication-factor >= 3 o Min.insync.replicas = 2 o Unclean.leader.election = false o Auto.offset.commit = false o Commit after processing o Monitor! Apache Kafka Lessons Learned @ PAYBACK Be Safe, Not Sorry
  • 19.
    19 Monitoring Apache Kafka LessonsLearned @ PAYBACK http://www.spiegel.de/spiegel/print/d-129456859.html
  • 20.
    20 KafkaBrokerKafkaBroker KafkaBroker Timeseries Metrics toGraphite Apache Kafka Lessons Learned @ PAYBACK Metrics Library + Graphite Reporter Graphite Grafana KafkaConsumer Metrics Library + Graphite Reporter
  • 21.
    21 Kafka-Manager: Open SourceUI/API Kafka Mgmt Tool Apache Kafka Lessons Learned @ PAYBACK https://github.com/yahoo/kafka-manager • Good for current cluster status and ad-hoc analysis • Provides a status API (HTTP) • Consumers only displayed during active consumption • 0.10.x support still not merged
  • 22.
    22 Kafka-Manager API Example ApacheKafka Lessons Learned @ PAYBACK curl –XGET http://kafka-manager/api/status/VP2/mdeAppGroup/groupSummary?consumerType=KF {memberDataChanges: {totalLag: 142, percentageCovered: 100, partitionOffsets: [1779279, 372957, 368100, 372415, 368349, 374649, 373262, 373934, 1775065, 373339, 369416, 374362, […]
  • 23.
    23 Burrow: API onlyConsumer Lag Checking Apache Kafka Lessons Learned @ PAYBACK { error: false, message: "consumer group status returned", status: { cluster: "vp2", group: "mdeAppGroup", status: "ERR", complete: false, partitions: [ { topic: "memberDataChanges", partition: 1, status: "STOP", start: { offset: 1775109, timestamp: 1485253978439, lag: 0 }, end: { offset: 1775127, timestamp: 1485254054861, lag: 1 } },}, […] totallag: 8 }, request: {url: "/v2/kafka/vp2/consumer/mdeAppGroup/lag", host: "hqiqlpxxap89", cluster: "vp2", group: "mdeAppGroup", topic: "" } } curl –XGET http://burrow/v2/kafka/vp2/consumer/mdeAppGroup/lag No Thresholds required Alerting via email and HTTP POST Issue: Calculate lag at request time, not commit time
  • 24.
    24 "God gives thenuts, but he does not crack them." Franz Kafka PAYBACK GmbH Maxim Schelest Thomas Falkenberg Theresienhöhe 12 80339 München Phone +49 (0) 89 997 41 – 0 PAYBACK.net | PAYBACK.de