How to implement a solution using Kafka as a distributed database, KafkaStreams as a glue for different services and how to apply some Domain Driven Design concepts to ensure data integrity and design the boundaries of each service.
A Functional Approach to Architecture - Kafka & Kafka Streams - Kevin Mas Ruiz & Alexey Gravanov (joint in Munich & Barcelona only)
KAFKA & KAFKA STREAMS
KEVIN MAS RUIZ & ALEXEY GRAVANOV
Kevin Mas Ruiz
WHO WE ARE?
WHAT TO EXPECT?
● To meet ScoutWorks :)
● Tales about business requirements
● A brief introduction to some Kafka & Kafka Streams conventions
● See how we designed our architecture
● Talk about resilience in a functional architecture
● Platform for selling cars & motorbikes
● 8 countries + 10 language versions
● 55+ thousands dealers
● 2,4+ millions listings
● 3+ billions page impression per month
● 10+ millions active users per month
● Core of domain are listings
● Images are one of the main point of information of listings
● Dealers want to export those listings to other marketplaces
A system able to export dealers’ high quality listings
to other marketplaces to improve her visibility on the market.
● A dealer is capable of enabling and disabling the export process
● All active listings of a dealer will be exported
● Exported listings that become inactive or deleted should be hidden
on external marketplaces
MORE BUSINESS REQUIREMENTS
● It’s acceptable to not have latest listing information exported in real-time,
but it should be eventually updated
● It’s important to have all listings on external marketplaces ASAP to ensure
● Listings data format is dynamic, so it should be possible to reprocess the
listing and export again
● Load fluctuates during the day, scaling up / down is mandatory
● Easy to add additional marketplaces
● Easy to monitor / trace any listing
WHAT IS KAFKA?
● Distributed streaming platform
● Records are published in topics, which formed by partitions
● Each partition is an append-only (*) structured commit log
● Records consist of partition key, a value and a timestamp, and an assigned
offset, which means position of record in the log
● Sharding of records based on partition key
● Replication of records depending on configuration
● Ordering of records within partition
● At-least-once delivery guarantee of records
Kafka is often used for building real-time streaming applications
that transform or react to the streams of data.
● Listings change propagation fits very well to Kafka streaming mindset
● Possibility to go back in time and reprocess records if needed
● Enables developers to design thinking in a composition of small functions
● Opinionated library to process streams or records
● Provides possibility to build elastic, scalable and fault-tolerant solutions
● Uses Kafka to store current offsets / intermediate state of processed data
● Supports stateless processing, stateful processing or windowing
operations, e.g. aggregates of records
● For stateless operations, allows to see microservices as state-ignorant
pure functions, letting Kafka Streams to take care of side-effects
Functions run once and
completely, can not be
Functions can be chained
generating more abstract
State is shared as a
parameter, avoiding mutable
state between functions
● Can only be ensured on a single partition
● Is degraded when repartitioning
● Is the boundary of consistency
● Is a set of records in a single topic with the same partition key
● Represents a single business object (for example, a Listing)
"Everything fails all the time."
VP & CTO at Amazon.com
For every topic with replication factor of N,
Kafka tolerates failures up to N-1 nodes.
● One node setup: after coming back, picking up where processing stopped
● Multi-node setup: other nodes taking over, but…
○ Stateless processor: continue working as soon as nodes are re-balanced
○ Stateful processor, simple setup: can take a while until state is built up
○ Stateful processor, hot stand-by setup: local state is being build-up, but records are
not being actually processed until failover happens
● Function signature should be unique (only one function should be
responsible of a single transformation)
● Functions, by design, should not pertain to a single domain, but
map two domains
● The consistency boundary is a partition (or a single aggregate root)
● A system can be seen as a composition of functions, but data needs
to be managed by an external system.
● As a function, we should test transformations, not side-effects.
● Adding a correlation id on data sources is really useful for tracing, but
boundaries should be chosen carefully.
● Kafka Streams should not be used for external I/O. For example, if
you need a service that makes HTTP requests, use another streaming
engine for that (we used Akka Streams).
● Kafka Streams’ learning curve is really steep.
● Kafka Streams and Kafka by default are not there yet for medium size
messages (like ~50KB). You will need to tweak and optimize the
● Backpressure is a natural fit as functions are pull-based.
● Single-direction data-flow is a mindset that needs to be learned and
For questions or suggestions:
Kevin Mas Ruiz (@skmruiz)
Alexey Gravanov (@gravanov)