With increasing data volumes typically comes a corresponding increase in (non-windowed) batch processing times, and many companies have looked to streaming as a way of delivering data processing results faster and more reliably. Event Driven Architectures further enhance these offerings by breaking centralised Data Platforms into loosely coupled and distributed solutions, supported by linearly scalable technologies such as Apache Kafka(TM) and Apache Kafka Streams(TM).
However, there remains a problem of how to handle changes to operational systems: if a record is the result of business logic, and that business logic changes, what do we do? Do we recalculate everything on the fly, adding in additional latencies for all data requests and potentially breaching non-functional requirements? Or do we run a batch job, risking that incorrect data will be served whilst the job is running?
This talk covers how 6point6 leveraged Kafka and Kafka Streams to transition a customer from a traditional business flow onto an Event Driven Architecture, with business logic triggered directly by real-time events across over 3000 loosely coupled business services, whilst ensuring that the active development of these services (and their containing logic and models) would not affect components which relied on data served by the platform.
Learn how we:
– Used versioning of topics, data and business logic to facilitate iterative development, ensuring that the reprocessing of large volumes of data would not result in incorrect or stale data being delivered.
– Handled distributed versioning of JSON event messages between separate teams/services, using discovery, automated contract negotiation and version porting.
– Developed technical patterns and libraries to allow rapid development and deployment of new event driven services using Kafka Streams.
– Developed functionality and approaches for deploying defensive services, including strategies for event retry and failure.
Similar to A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event Driven Platform (Peter Hannam, 6point6) Kafka Summit London 2019
Similar to A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event Driven Platform (Peter Hannam, 6point6) Kafka Summit London 2019 (20)
2. The Problem
Changes such as business logic modifications, adding enrichments, or bug fixing could
invalidate cached and persisted data.
And consumers need to be protected from changes by upstream producers.
How do we:
• Support active development and iterative releases
• Deal with failure pragmatically
• Only serve current and correct data
• Avoid breaching non-functional requirements
With agile approaches we (typically) don’t do big-bang releases but develop business
functionality through frequent incremental releases.
3. Classic approaches don’t meet these needs
Calculate on-demand
Pull model: Business logic is
applied on request.
Bulk recalculation ƛ
Push model: Result of business
logic is stored in a persisted store,
with the calculation running as a
large job.
Stream replay ϰ
Push model: Business logic is
applied to single messages or
small windows and then sent to
another stream or stored/cached.
• No risk of staleness as data is
always fresh
• Cannot use pure-streaming as
needs random access to data
• May breach NFRs if calculation
takes too long or dependent
services are slow
• Streaming!
• Risk of staleness or missing
data
• Can be used in conjunction
with streaming
• Risk of staleness or missing
data as request may happen
whilst results are being
updated
• Depending on scenarios,
tends to be more efficient for
large reprocessing
5. Our approach
• Everything is versioned:
• Importantly, the persisted data is as well
• REST service requests appropriately
versioned persisted data
• Persisted data updated after business
logic change through bulk processing:
• Not really a lambda architecture à only run
bulk processing during uplift
• REST service could request data
waiting to be recalculated:
• Have a mechanism to ‘fast track’
calculations of requested stale data
6. Fully streaming
• Single code-base supports both BAU
and stale-data requests
• Need a separate topic and job for fast
tracked recalculation requests to allow
them to be prioritised:
• Cannot currently configure Consumers &
Streams to prioritise a topic (KIP-349)
• Simpler streaming jobs but increased
complexity in the REST services:
• REST services have to poll database until
version is updated
• REST services have to know how to put
messages onto the recalc topic%
7. Encapsulated REST services
• Streaming jobs are simple co-
ordinators:
• REST services contain the business logic
and do all data processing
• REST services are simpler and self-
contained for stale data requests:
• Don’t need to put messages back onto a
topic
• Requests simply block until the data is
updated
• Topic management is much simpler…
• But the complexity has moved into the
streaming jobs:
• Have to handle failure states for REST such
as overload, timeouts, transactions, etc
8. Bulk reprocessing
• ƛ-style: batch-job
• Requires original messages to have been saved to non-streaming store
• If business logic is held within REST services, can simply call them
• If stream job, can copy data back to the original stream or a new one
• Could also have the bulk job run the business logic, but then have duplicity issues
• ϰ-style: stream replay
• Have to be careful of caches: single topic can’t efficiently serve both the head and random access
• May be better to have two topics: one for live data and one for replays
• Simpler deployment stack as doesn’t require extra technology à simply reset the offset
10. Defensive strategies
• Producers validate all messages against a JSON schema before sending
• Consumers accept a fixed range of schema versions and validate payloads:
• Ensure standards compliance of producer by ignoring incorrectly headered messages
• Will ignore any messages which fall outside of configured schema/library range
• Messages which fail schema validation are sent to a DLQ
• Schema failures over a configured windowed-threshold could kill the job
• Retry and failure functionality
• All interactions with services/systems use patterns such as circuit breakers and
back-pressure
• Baked in libraries such as Micrometer and OpenTracing to enforce integration with
operations stack
11. Retry & failure
• Developers could choose from multiple
retry and failure options
• Could combine strategies:
• Retry 5 times, then pause 5 times, then fail
• Pause and park had exponential time
strategies:
• Short pause now but then give a service
more time to recover
• Retry:
• Retry message immediately
• Pause:
• Pauses the job entirely
• Used when ordering is important
• Park:
• Puts message on hold but keeps processing
topic
• Could use multiple topics or RocksDB
• Used when ordering isn’t important
• Fail:
• Sends message to DLQ immediately
• Schema validation errors and the like
12. DLQs
• If all else fails, ask a human to intervene
• May need to kill the job, if message ordering is important
• Need to provide tooling for admins to investigate and replay messages from the DLQ
• Have to think about replays:
• Track attempts so that a message doesn’t keep looping through
• What happens if you replay the topic: do you want to replay dead messages?
• Typically only saw messages on the DLQ during pre-prod testing:
• Retry strategies dealt with most services issues
• Schema enforcement in producer removed chance of corrupt messages
13. Outcomes
• Reduced data risks surrounding deployments:
• Only correct data would be served via services
• Platform downtime was reduced:
• Allowing upgrade scripts to be run during live hours reduced deployment pressures
• Provided reassurance around NFRs during upgrades:
• Platform was sized to ensure NFRs weren’t breached during deployments
• Customer bought into trade-offs on increased latency for increased correctness
• Provided strong contracts between teams:
• Defensive measures allowed developers to focus on business logic