This document provides an overview of a company's first Kafka Streams project to build a streaming data pipeline. Some key lessons learned include adopting a data-first mindset where the data defines the application behavior and architecture. All business logic is modeled as data transformations. Testing was done using TopologyTestDriver for unit tests and emulators for external systems. Kafka Streams was determined to be a good fit as it provided an ordered, fault-tolerant processing pipeline with exactly-once guarantees. Future work includes open sourcing components and improving the declarative side effect handling in the KStreams DSL.
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
From Zero to Streaming Healthcare with Kafka Streams
1. From Zero to Streaming
Healthcare
Alex Kouznetsov
Invitae
2. Overview
● Our first Kafka Streams project
● Techniques, challenges and lessons learned
● Ambitious amount of slides — fast high level overview, less technical detail
4. The project
● Sendout to a partner lab
● Us: Python, Django REST, RMQ
● Them: poll-only HTTP API
○ create order for a sample → ok
○ query sample status → status
○ query for test results (by time range) → list of order IDs
○ get report for an order ID → report payload
7. Data-first mindset
● Code is transient, data lives on
● … so don’t hide it.
● Data defines behavior (especially in streaming apps)
● Data defines architecture
● Abolition of data ownership
● Every executable unit of business logic is a transformation of state
● Your business is a function
● Doesn’t that look like FP?
○ Applications are pure functions
○ Applications form a declarative pipeline
○ Inputs and outputs are strongly typed
8. Solving for data-first
● Breaking away from imperative mindset
● Model data to directly represent business logic (i.e. truth)
● Shifting logic from code to data by increasing granularity and precision
○ Higher precision → simpler transformations
○ Harder to refactor data than code
● How much design is enough? Until:
○ It makes sense on paper
○ Ephemeral state is eliminated
○ You don’t need logs any more
● Hypothesis: we can build the whole thing as a streaming app
12. Kafka != Messaging System
● Expect to see them messages again
● Everything must replay consistently
● Everything must replay safely
○ Trickle-down idempotency and boundaries
● Remember what you did — contribute to total state, then aggregate
13. Knowing when (and how) to crash
● Almost never :)
● Two crash reasons: bad circumstance, bad data
● Crashing on data is same as not crashing plus one manual step
● Don’t waste state: turn everything into data and write to a topic
● Every transformation should produce something (i.e., don’t .foreach())
● Data granularity helps layer strictness in smaller increments, minimize failure
impact
16. Topics and schemas: nice things
● Topic objects are values → “Find usages”
● Serde config is embedded in topic type
● StreamBuilder.from[K,V](Topic[_, _]): KStream[K,V] fixes K,V
● Having to think about (de)serializers: never!
● Topic creation, schema registration and compatibility checks: automated!
19. Solving scheduling
● Need to periodically rerun aggregates to produce new messages
● Problem: Streams client reacts only to new messages
● Solution: send a “trigger” message and .leftJoin() with KTable
● Needs a repartitioning trick
○ aggregate into a single-valued table, value is a set, or
○ make buckets, or
○ fan out trigger to match all keys (if known in advance)
● 👍 can source triggers from anywhere (we made a cron-like connector)
● avoid reacting to accumulated triggers
20. Solving IO
● Problem: need to call APIs
● Connectors?
● Problem: want to handle API calls as side effects in streaming context
● Solution: doing it in place works
○ For fast/idempotent effects
21. Solving ordering
● Sometimes request and
response objects are dependent
enough to need co-partitioning
● Total ordering ensures
consistency of log
● Case study: consecutive time
interval queries
22. Transparency: topology diagrams
● Writing stream apps is fun at first, but then topology grows
● Problem: understanding how the large application works
● Solution: Topology.describe() all the things and glue
23. Transparency: metrics
● JMX/Prometheus metrics from CP Helm charts are OK for many cases
● For others, we make a streaming app :)
● Aggregate the metrics we need, then two options:
○ Spin up a Prometheus server thread, reading from state store
○ Push to push-gateway
24. Transparency: tracing
● Zipkin with customized Brave integration
● Write traces to a topic (using a Kafka Reporter from Zipkin API)
● Provide DSL support for emitting spans with tags and annotations
25. Scala + FP + cats = MEOW
● FP: interesting buy-in dynamics, safer more generic code in the long run
● Errors as values, natural conversion of effects/errors to streams
● Cats makes FP better, also excellent onboarding tool
● Type class pattern helps solve Avro and Serdes
● Managed errors, IO and Effects in tests
● Tagless Final → implement and test components with ease
26. Testing
● TopologyTestDriver + multiple simultaneous topologies = integration-like unit
tests
● Modular emulators for external systems (both embeddable and standalone)
● Time provider connected to TTD’s own time base
27. Topic planning
● Not all topics are created equal (primary vs derived, internal vs exposed)
● Consider long term retention, long lived schema
● Think about injection points
28. How did we do?
● Pretty well
● Launched on time with few surprises
○ Sudden offset loss, some non-idempotency leakage
○ IO outliving transactions, checkpoints, internal stores
○ Request/response dependency guesses were almost all correct
● System is fault-tolerant and self-recovering
● Framework for onboarding
● Were Streams the right choice?
○ Principled functional pipeline
○ EXACTLY_ONCE
○ Tracing, topology generation
○ Not all use cases, but many
29. Future TODOs (WIP)
● Open source more, externalize blog
● Improved topology derivation
● Better declarative side effects (KStreams DSL, topology)
● Formalize decoupled IO apps (micro sagas)