From Zero to Streaming
Healthcare
Alex Kouznetsov
Invitae
Overview
● Our first Kafka Streams project
● Techniques, challenges and lessons learned
● Ambitious amount of slides — fast high level overview, less technical detail
Data Engineering
The project
● Sendout to a partner lab
● Us: Python, Django REST, RMQ
● Them: poll-only HTTP API
○ create order for a sample → ok
○ query sample status → status
○ query for test results (by time range) → list of order IDs
○ get report for an order ID → report payload
The plan
● Kafka, Schema Registry, KStreams, Scala
Lesson 1: Data is the Application
Data-first mindset
● Code is transient, data lives on
● … so don’t hide it.
● Data defines behavior (especially in streaming apps)
● Data defines architecture
● Abolition of data ownership
● Every executable unit of business logic is a transformation of state
● Your business is a function
● Doesn’t that look like FP?
○ Applications are pure functions
○ Applications form a declarative pipeline
○ Inputs and outputs are strongly typed
Solving for data-first
● Breaking away from imperative mindset
● Model data to directly represent business logic (i.e. truth)
● Shifting logic from code to data by increasing granularity and precision
○ Higher precision → simpler transformations
○ Harder to refactor data than code
● How much design is enough? Until:
○ It makes sense on paper
○ Ephemeral state is eliminated
○ You don’t need logs any more
● Hypothesis: we can build the whole thing as a streaming app
Solving it on paper
The core
Lesson 2: Total State
Kafka != Messaging System
● Expect to see them messages again
● Everything must replay consistently
● Everything must replay safely
○ Trickle-down idempotency and boundaries
● Remember what you did — contribute to total state, then aggregate
Knowing when (and how) to crash
● Almost never :)
● Two crash reasons: bad circumstance, bad data
● Crashing on data is same as not crashing plus one manual step
● Don’t waste state: turn everything into data and write to a topic
● Every transformation should produce something (i.e., don’t .foreach())
● Data granularity helps layer strictness in smaller increments, minimize failure
impact
Configuration as code
● Why not our application code?
Topics, schemas and SerDes
Topics and schemas: nice things
● Topic objects are values → “Find usages”
● Serde config is embedded in topic type
● StreamBuilder.from[K,V](Topic[_, _]): KStream[K,V] fixes K,V
● Having to think about (de)serializers: never!
● Topic creation, schema registration and compatibility checks: automated!
Opsing topics
⇒
Opsing schemas
Solving scheduling
● Need to periodically rerun aggregates to produce new messages
● Problem: Streams client reacts only to new messages
● Solution: send a “trigger” message and .leftJoin() with KTable
● Needs a repartitioning trick
○ aggregate into a single-valued table, value is a set, or
○ make buckets, or
○ fan out trigger to match all keys (if known in advance)
● 👍 can source triggers from anywhere (we made a cron-like connector)
● avoid reacting to accumulated triggers
Solving IO
● Problem: need to call APIs
● Connectors?
● Problem: want to handle API calls as side effects in streaming context
● Solution: doing it in place works
○ For fast/idempotent effects
Solving ordering
● Sometimes request and
response objects are dependent
enough to need co-partitioning
● Total ordering ensures
consistency of log
● Case study: consecutive time
interval queries
Transparency: topology diagrams
● Writing stream apps is fun at first, but then topology grows
● Problem: understanding how the large application works
● Solution: Topology.describe() all the things and glue
Transparency: metrics
● JMX/Prometheus metrics from CP Helm charts are OK for many cases
● For others, we make a streaming app :)
● Aggregate the metrics we need, then two options:
○ Spin up a Prometheus server thread, reading from state store
○ Push to push-gateway
Transparency: tracing
● Zipkin with customized Brave integration
● Write traces to a topic (using a Kafka Reporter from Zipkin API)
● Provide DSL support for emitting spans with tags and annotations
Scala + FP + cats = MEOW
● FP: interesting buy-in dynamics, safer more generic code in the long run
● Errors as values, natural conversion of effects/errors to streams
● Cats makes FP better, also excellent onboarding tool
● Type class pattern helps solve Avro and Serdes
● Managed errors, IO and Effects in tests
● Tagless Final → implement and test components with ease
Testing
● TopologyTestDriver + multiple simultaneous topologies = integration-like unit
tests
● Modular emulators for external systems (both embeddable and standalone)
● Time provider connected to TTD’s own time base
Topic planning
● Not all topics are created equal (primary vs derived, internal vs exposed)
● Consider long term retention, long lived schema
● Think about injection points
How did we do?
● Pretty well
● Launched on time with few surprises
○ Sudden offset loss, some non-idempotency leakage
○ IO outliving transactions, checkpoints, internal stores
○ Request/response dependency guesses were almost all correct
● System is fault-tolerant and self-recovering
● Framework for onboarding
● Were Streams the right choice?
○ Principled functional pipeline
○ EXACTLY_ONCE
○ Tracing, topology generation
○ Not all use cases, but many
Future TODOs (WIP)
● Open source more, externalize blog
● Improved topology derivation
● Better declarative side effects (KStreams DSL, topology)
● Formalize decoupled IO apps (micro sagas)
https://github.com/invitae/
unthingable@GH
alexk@invitae.com
Q & A

From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invitae) Kafka Summit SF 2019

  • 1.
    From Zero toStreaming Healthcare Alex Kouznetsov Invitae
  • 2.
    Overview ● Our firstKafka Streams project ● Techniques, challenges and lessons learned ● Ambitious amount of slides — fast high level overview, less technical detail
  • 3.
  • 4.
    The project ● Sendoutto a partner lab ● Us: Python, Django REST, RMQ ● Them: poll-only HTTP API ○ create order for a sample → ok ○ query sample status → status ○ query for test results (by time range) → list of order IDs ○ get report for an order ID → report payload
  • 5.
    The plan ● Kafka,Schema Registry, KStreams, Scala
  • 6.
    Lesson 1: Datais the Application
  • 7.
    Data-first mindset ● Codeis transient, data lives on ● … so don’t hide it. ● Data defines behavior (especially in streaming apps) ● Data defines architecture ● Abolition of data ownership ● Every executable unit of business logic is a transformation of state ● Your business is a function ● Doesn’t that look like FP? ○ Applications are pure functions ○ Applications form a declarative pipeline ○ Inputs and outputs are strongly typed
  • 8.
    Solving for data-first ●Breaking away from imperative mindset ● Model data to directly represent business logic (i.e. truth) ● Shifting logic from code to data by increasing granularity and precision ○ Higher precision → simpler transformations ○ Harder to refactor data than code ● How much design is enough? Until: ○ It makes sense on paper ○ Ephemeral state is eliminated ○ You don’t need logs any more ● Hypothesis: we can build the whole thing as a streaming app
  • 9.
  • 10.
  • 11.
  • 12.
    Kafka != MessagingSystem ● Expect to see them messages again ● Everything must replay consistently ● Everything must replay safely ○ Trickle-down idempotency and boundaries ● Remember what you did — contribute to total state, then aggregate
  • 13.
    Knowing when (andhow) to crash ● Almost never :) ● Two crash reasons: bad circumstance, bad data ● Crashing on data is same as not crashing plus one manual step ● Don’t waste state: turn everything into data and write to a topic ● Every transformation should produce something (i.e., don’t .foreach()) ● Data granularity helps layer strictness in smaller increments, minimize failure impact
  • 14.
    Configuration as code ●Why not our application code?
  • 15.
  • 16.
    Topics and schemas:nice things ● Topic objects are values → “Find usages” ● Serde config is embedded in topic type ● StreamBuilder.from[K,V](Topic[_, _]): KStream[K,V] fixes K,V ● Having to think about (de)serializers: never! ● Topic creation, schema registration and compatibility checks: automated!
  • 17.
  • 18.
  • 19.
    Solving scheduling ● Needto periodically rerun aggregates to produce new messages ● Problem: Streams client reacts only to new messages ● Solution: send a “trigger” message and .leftJoin() with KTable ● Needs a repartitioning trick ○ aggregate into a single-valued table, value is a set, or ○ make buckets, or ○ fan out trigger to match all keys (if known in advance) ● 👍 can source triggers from anywhere (we made a cron-like connector) ● avoid reacting to accumulated triggers
  • 20.
    Solving IO ● Problem:need to call APIs ● Connectors? ● Problem: want to handle API calls as side effects in streaming context ● Solution: doing it in place works ○ For fast/idempotent effects
  • 21.
    Solving ordering ● Sometimesrequest and response objects are dependent enough to need co-partitioning ● Total ordering ensures consistency of log ● Case study: consecutive time interval queries
  • 22.
    Transparency: topology diagrams ●Writing stream apps is fun at first, but then topology grows ● Problem: understanding how the large application works ● Solution: Topology.describe() all the things and glue
  • 23.
    Transparency: metrics ● JMX/Prometheusmetrics from CP Helm charts are OK for many cases ● For others, we make a streaming app :) ● Aggregate the metrics we need, then two options: ○ Spin up a Prometheus server thread, reading from state store ○ Push to push-gateway
  • 24.
    Transparency: tracing ● Zipkinwith customized Brave integration ● Write traces to a topic (using a Kafka Reporter from Zipkin API) ● Provide DSL support for emitting spans with tags and annotations
  • 25.
    Scala + FP+ cats = MEOW ● FP: interesting buy-in dynamics, safer more generic code in the long run ● Errors as values, natural conversion of effects/errors to streams ● Cats makes FP better, also excellent onboarding tool ● Type class pattern helps solve Avro and Serdes ● Managed errors, IO and Effects in tests ● Tagless Final → implement and test components with ease
  • 26.
    Testing ● TopologyTestDriver +multiple simultaneous topologies = integration-like unit tests ● Modular emulators for external systems (both embeddable and standalone) ● Time provider connected to TTD’s own time base
  • 27.
    Topic planning ● Notall topics are created equal (primary vs derived, internal vs exposed) ● Consider long term retention, long lived schema ● Think about injection points
  • 28.
    How did wedo? ● Pretty well ● Launched on time with few surprises ○ Sudden offset loss, some non-idempotency leakage ○ IO outliving transactions, checkpoints, internal stores ○ Request/response dependency guesses were almost all correct ● System is fault-tolerant and self-recovering ● Framework for onboarding ● Were Streams the right choice? ○ Principled functional pipeline ○ EXACTLY_ONCE ○ Tracing, topology generation ○ Not all use cases, but many
  • 29.
    Future TODOs (WIP) ●Open source more, externalize blog ● Improved topology derivation ● Better declarative side effects (KStreams DSL, topology) ● Formalize decoupled IO apps (micro sagas)
  • 30.
  • 31.