Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hakan Lofcali & Stefan Sprenger

Learnings From Shipping
1000+ Streaming Data Pipelines
To Production
Hakan Lofcali, Stefan Sprenger
{hakan,stefan}@datacater.io

‣We develop tools for developers working with streaming data
‣With Kafka, Kubernetes, and less than 5 developers, we built a platform that helped teams to
deploy more than 1,000 streaming data pipelines to production
‣Let’s take you on our journey and the tools we adopted, hurdles encountered, and solutions
found
‣Infra Space
‣Customer Solutions Space
2
WHAT WE DO / WHO WE ARE

3
STREAMING DATA PIPELINES
Continuous Applications of Data Transformations
Kafka Topic
Kafka
Connect
Source
Connector
Kafka
Connect
Sink
Connector
Kafka Topic Kafka Streams

4
STREAMING PIPELINES IN THE WILD
Customer communications in real-time
Actionable
data
Clickstream
data
Outbox
service
Raw data
Process
clickstream events

5
GOALS FOR THIS TALK
Avoid common pitfalls in streaming ETL
‣How to operate streaming data pipelines in an efficient and robust manner?
‣How to deal with resource-leaking Kafka Connect connectors?
‣How to monitor and debug running pipelines?
‣What are ways to deal with large data sources or slow data sinks?
‣What is missing in today’s ecosystem for streaming to become a commodity?

How to operate streaming data pipelines in
an efficient and robust manner?
6

7
SEPARATE BY NODE POOL
Kafka NodePool
Apache Kafka
Broker*
Strimzi Kafka
Operator
Control Plane Nodepool
Pod
We operate one K8s Cluster - Multiple Node pools
K8s StatefulSet
K8s Deployment
DataCater
Control Plane
K8s Deployment
…
Kafka Connect Nodepool
Quarkus
Pipeline
Pod
K8s Deployment(s) ▸ Max 110 pods per
node
▸ Max 5,000 nodes per
cluster
▸ Max 150,000 pods in
total
▸ *Separate Kafka and
Kafka Connect
clusters

State-of-the-art Orchestration
8
PROCESS ORCHESTRATION
▸ We started out on a single VM and moved to a distributed
process orchestration tool
▸ Kafka’s ecosystem is lagging state of the art process
orchestration like Kubernetes, Nomad, etc.
▸ ksqlDB and Kafka Connect manage processes, but we will
see how they are lacking fundamental patterns to be
operated at scale
Kafka Streams
Single VM Docker
Java Quarkus
Kubernetes

Quarkus SmallRye Reactive Messaging
9
STARTUP TIME
Scheduled
0s
Scheduled
60s
First Event
Processed
First Event
Processed
5s …
Liveness
OK
10s
Kafka Streams
30s

10
WORKLOAD DENSITY
Docker on Single VM
Quarkus
Quarkus
Quarkus
Kubernetes
Kafka Streams
Kafka Streams
RAM < 1.5GB
RAM < 1.5GB
RAM < 300MB
RAM < 300MB
RAM < 300MB

11
STRIMZI FOR KAFKA
…
Kubernetes; Dedicated node pool for Kafka
Apache Kafka
Broker
Apache Kafka
Broker
Apache Kafka
Broker
Strimzi Kafka
Operator
Kubernetes
StatefulSet
Pod Pod Pod

12
CONSUMER RE-BALANCING
summit
consumer 0
summit
consumer 1
summit
consumer 2
…
Pod Pod Pod
summit
partition 0
summit
partition 1
summit
partition 2
… … …
…
… …

13
CONSUMER RE-BALANCING
summit
consumer 0
summit
consumer 1
summit
consumer 2
…
Pod Pod Pod
summit
partition 0
summit
partition 1
summit
partition 2
… … …
…
… …
▸ Consumer re-balancing will
cause no consumption until
re-balancing is completed
by co-ordinator
▸ Number of consumers can
change due to errors,
disconnection, and
triggered by new load
requirements

14
UNEXPECTED SHUTDOWN
Startup Time
Partition Size

15
UNEXPECTED SHUTDOWN
Startup Time
Partition Size
Not Linear
Point of No Recovery
Log Size

How to deal with resource-leaking Kafka
Connect connectors?
16

17
KAFKA CONNECT SELF-MANAGED
…
Kubernetes (K8s); Dedicated Kafka Connect Nodepool
ElasticSearch Sink
Task C
S3 Source
Task A
PostgreSQL Source
Task A
K8s Deployment / Connect Cluster
Pod Pod Pod
PostgreSQL Source
Task B
MySQL CDC Source
Task A
MySQL CDC Source
Task B

18
…
ElasticSearch Sink
Task C
MySQL CDC Source
Task A
MySQL CDC Source
Task B
S3 SOURCE
TASK A
Pod Pod
PostgreSQL Source
Task A
PostgreSQL Source
Task B

19
…
ElasticSearch Sink
Task C
S3 SOURCE
TASK A
Pod Pod
MySQL CDC Source
Task A
MySQL CDC Source
Task B
PostgreSQL Source
Task A
PostgreSQL Source
Task B

20
…
ELASTICSEARCH SINK
TASK C
S3 SOURCE
TASK A
Pod Pod
MySQL CDC Source
Task A
MySQL CDC Source
Task B
PostgreSQL Source
Task A
PostgreSQL Source
Task B

Connect Cluster Connect Cluster
21
…
ElasticSearch Sink
Task C
S3 Source
Task A
Connect Cluster
Pod Pod Pod
MySQL CDC Source
Task A
MySQL CDC Source
Task B
PostgreSQL Source
Task A
PostgreSQL Source
Task B

22
…
ElasticSearch Sink
Task A
PostgreSQL Source
Task A
Connect Cluster
Pod Pod
S3 Source
Task A
Pod
MySQL CDC Source
Task A

23
…
S3 SOURCE
TASK A
ElasticSearch Sink
Task A
PostgreSQL Source
Task A
Connect Cluster
Pod Pod
MySQL CDC Source
Task A

24
…
S3 SOURCE
TASK A
ElasticSearch Sink
Task A
PostgreSQL Source
Task A
Connect Cluster
Pod Pod
MySQL CDC Source
Task A

25
…
ElasticSearch Sink
Task A
PostgreSQL Source
Task A
Connect Cluster
Pod Pod
S3 Source
Task A
Pod
MySQL CDC Source
Task A

▸ Utilise state of the art orchestration tools.
▸ Running Kafka on Kubernetes does not bring automatic elasticity.
▸ Kafka Connect is not self-contained. This will become a larger headache the more
connector tasks are running in a given cluster.
▸ Think about startup time throughout your tech stack. From Kafka brokers over Connect
tasks to streaming applications.
26
TAKE-AWAYS
Key Learnings

How to monitor and debug pipelines?
27

28
MONITORING STREAMING DATA PIPELINES
Kafka Topic
Kafka
Connect
Source
Connector
Kafka
Connect
Sink
Connector

▸ External data sources or data sinks are unavailable (temporarily)
▸ Consumers (processors or sink connectors) are slower than producers
▸ Processing of events fails
29
POTENTIAL PRODUCTION ISSUES
Most common issues in streaming data pipelines

30

31
MONITORING CONNECTORS
Kafka Topic
Kafka
Connect
Source
Connector
Kafka
Connect
Sink
Connector
Monitoring the health of connectors
‣Periodically call /connectors/:connector_name/status and investigate the response

32
GET /connectors/hdfs-sink/status
{
"name": "hdfs-sink",
"connector": {
"state": "RUNNING",
"worker_id": "localhost:8083"
},
"tasks":
[
{
"id": 0,
"state": "RUNNING",
"worker_id": “localhost:8083"
}
]
}
Healthy

33
GET /connectors/hdfs-sink/status
{
"name": "hdfs-sink",
"connector": {
"state": “FAILED",
"worker_id": "localhost:8083"
},
"tasks":
[
{
"id": 0,
"state": "FAILED",
"worker_id": “localhost:8083”,
"trace": "org.apache.kafka.common.errors.RecordTooLargeExceptionn"
}
]
}
Unhealthy

34
Kafka Topic
Kafka
Connect
Source
Connector
Kafka
Connect
Sink
Connector
‣Periodically call /connectors/:connector_name/status and investigate the response
‣If failed, try to restart the connector (e.g., deals with temporary API outages) and
escalate or alert after X restarts
‣Sometimes, directly escalating might be reasonable
Monitoring the health of connectors

35

36
MONITORING BACKPRESSURE
Consumer Lags
Kafka Topic Consumer
‣Difference between the latest offset available in the Kafka topic (partition) and the
latest offset processed by the consumer
‣Resembles how much consumers are behind producers in terms of number of records
processed

37
Kafka Topic
Kafka
Connect
Source
Connector
Kafka
Connect
Sink
Connector
Consumer Lags in Streaming Data Pipelines

38
Kafka Topic
Kafka
Connect
Source
Connector
Kafka
Connect
Sink
Connector
Kafka Streams Consumer Lag
‣Number of records that have been extracted by the data source connector but have not
yet been processed by the Kafka Streams app
‣If data processing is slower than extraction, you might want to increase the degree of
parallelism of the Kafka Streams app

39
Kafka Topic
Kafka
Connect
Source
Connector
Kafka
Connect
Sink
Connector
Sink Connector Consumer Lag
‣Number of records that have been processed by the Kafka Streams app but have not yet
been published by the sink connector
‣If publishing data to the data sinks is slower than processing, you might want to increase
the number of tasks of the sink connector

40

41
DEAD-LETTER QUEUES
Keep track of errors in processing
‣By default, Kafka Connect connectors fail
when observing errors in processing
‣We recommend to configure a dead-letter
queue (topic) for storing records that could
not be processed
‣Monitor the dead-letter queue topic and
manually investigate failed records
errors.tolerance = all
errors.deadletterqueue.topic.name = topic-dlq
Topic
Dead-letter
queue topic
Successful
processing
Failed
processing
Kafka
Connect
Source
Connector

What are ways to deal with large data
sources or slow data sinks?
42

43
DEALING WITH LARGE DATA SOURCES
‣Hurts a lot when performing initial snapshots,
which can take hours
‣Use multiple connectors for the same database
and make use of table.include.list
‣Adjust the snapshot query and consider only a
subset of the data source
‣Mitigate pain with incremental snapshotting
‣Accelerate snapshotting with parallelisation
PostgreSQL
Debezium
Source
Connector
TBs of data

44
DEALING WITH SLOW DATA SINKS
Kafka
Connect
Sink
Connector
Elasticsearch
‣Detect slow data sinks by monitoring the sink
connector consumer lag
‣Parallelise sending records to the data sink by
increasing the number of connector tasks
‣If available, batch multiple records and send
them with one request to the data sink
‣Avoid duplicated data delivery by adjusting
max.poll.records or max.poll.interval.ms

What is missing in today’s ecosystem for
streaming to become a commodity?
45

46
SERVERLESS TOPICS
‣Partitioned topics are the de-facto standard for
persisting events
‣# partitions = maximum degree of parallelism
‣Choosing the number of partitions remains a crucial
questions with significant impact on future cost and
performance, and needs to be answered at topic
creation time (!)
‣Having the ability to dynamically choose the degree of
parallelism would allow to easier cope with peak loads
"Horizontal Partition Autoscaler”
Partition 0
1 partition
Partition 0 Partition 1 Partition 2
3 partitions
Partition 0
1 partition
Scale Up
Scale Down

47
EASE OPERATIONS
More and better managed services
‣Operating streaming data pipelines boils down to running multiple distributed
systems and remains one of the big hurdles for its adoption
‣Managed services can reduce the operational pain
‣We witness the rise of cloud/SaaS offerings but believe there is still lots of room for
improvement

49
TAKE-AWAYS
Summary
‣Throwing Kafka and Kafka Connect at Kubernetes is beneficial but does not provide
a true cloud-native experience. It takes a few steps to, for instance, apply the self-
containment principle to Kafka Connect.
‣If possible, try to handle errors of connectors or streaming applications in an
automated manner without bringing the pipeline down
‣A lot of issues occur when integrating external systems that you do not control, e.g.,
snapshotting a very large database table, sending events to slow APIs, etc.

Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hakan Lofcali & Stefan Sprenger

Recommended

Recommended

More Related Content

Similar to Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hakan Lofcali & Stefan Sprenger

Similar to Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hakan Lofcali & Stefan Sprenger (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hakan Lofcali & Stefan Sprenger