Streaming analytics with Python and Kafka

© MOSAIC SMART DATA 1
Egor Kraev, Head of AI, Mosaic Smart Data
PyData, April 4, 2017
Streaming analytics with asynchronous Python and Kafka

Overview
▪ This talk will show what streaming graphs are, why you
really want to use them, and what the pitfalls are
▪ It then presents a simple, lightweight, yet reasonably robust
way of structuring your Python code as a streaming graph,
with the help of asyncio and Kafka

A simple streaming system
▪ The processing nodes are often stateful, need to process the messages
in the correct sequence, and update their internal state after each
message (an exponential average calculator is a basic example)
▪ The graphs often contain cycles, so for example A ->B -> C -> A
▪ The graphs nearly always have some nodes containing and emitting
multiple streams

Why structure your system as a streaming graph?
▪ Makes the code clearer
▪ Makes the code more granular and testable
▪ Allows for organic scaling
▪ Start out with the whole graph in one file, can gradually split it up to each node
being a microservice with multiple workers
▪ As the system grows, nodes can run in different languages/frameworks
▪ Makes it easier to run the same code on historic and live data
▪ Treating your historical run as replay also solves some realtime problems such
as aligning different streams correctly

Two key features of a streaming graph framework
1. Language for graph definition
▪ Ideally, the same person who writes the business logic in the processing nodes should
define the graph structure as well
▪ This means the graph definition language must be simple and natural
2. Once the graph is defined, scheduling is an entirely separate, hard
problem
▪ If we have multiple nodes in a complex graph, with branchings, cycles, etc, what order
do we call them in?
▪ Different consumers of the same message stream, consuming at different rates - what to
do?
▪ If one node has multiple inputs, what order does it receive and process them in?
▪ What if an upstream node produces more data than a downstream node can process?

Popular kinds of scheduling logic
1. Agents
▪ Each node autonomously decides what messages to send
▪ Each node accepts messages sends to it
▪ Logic for buffering and message scheduling needs to be defined in each node
▪ For example, pykka
2. 'Push' approach
▪ First attempt at event-driven systems tends to be ‘push’
▪ For example 'reactive' systems, eg Microsoft’s RXPy
▪ When an external event appears, it’s fed to the entry point node.
▪ Each node processes what it receives, once done, triggers its downstream nodes
▪ Benefit: simpler logic in the nodes; each node must only have a list of its
downsteam nodes to send messages to

Problems with the Push approach
1. What if the downstream can't cope?
▪ Solution: 'backpressure': downstream nodes are allowed to signal upstream when
they're not coping
▪ That limits the amount of buffering we need to do internally, but can bring its own
problems.
▪ Backpressure needs to be implemented well at framework level, else we end up with a
callback nightmare: each node must have callbacks to both upstream and downstream,
and manage these as well as an internal message buffer (RXPy as example)
▪ Backpressure combined with multiple dowstreams can lead to processing locking up
accidentally,
2. Push does really badly at aligning merging streams
▪ Even if individual streams are ordered, different streams are often out of sync
▪ What if the graph branches and then re-converges, how do we make sure the 'right'
messages from both branches are processed together?

The Pull approach
▪ Let's turn the problem on its head!
▪ Let's say each node doesn't need to know its downstream, only its
parents.
▪ The execution is controlled by the downmost node. When it's ready, it
requests more messages from its parents
▪ No buffering needed
▪ When streams merge, the merging node is in control, decides which
stream to consume from first
Limitations:
▪ The sources must be able to wait until queried
▪ Has problems with two downstream nodes wanting to consume the
same message stream

The challenge
I set out to find or create an architecture with the following properties:
▪ Allows realtime processing
▪ All user-defined logic is in Python with fairly simple syntax
▪ Both processing nodes and graph structure
▪ Lightweight approach, thin layer on top of core Python
▪ Can run on a laptop
▪ Scheduling happens transparently to the user
▪ No need to buffer data inside the Python process (unless you want to)
▪ Must scale gracefully to larger data volumes

What is out there?
▪ In the JVM world, there's no shortage of great streaming systems
▪ Akka Streams: a mature library
▪ Kafka Streams: allows you to treat Kafka logs as database tables, do joins etc
▪ Flink: Stream processing framework that is good at stateful nodes
▪ On the Python side, a couple of frameworks are almost what I want
▪ Google Dataflow only supports streaming Python when running in Google Cloud,
local runner only supports finite datasets
▪ Spark has awesome Python support, but it's basic approach is map-reduce on
steroids, doesn't fit that well with stateful nodes and cyclical graphs

Collaborative multitasking, event loop, and asyncio
▪ The event loop pattern, collaborative multitasking
▪ An ‘event loop’ keeps track of multiple functions that want to be executed
▪ Each function can signal to it whether it’s ready to execute or waiting for input
▪ The event loop runs the next function until it has nothing to process, it then
surrenders control back to event loop
▪ A great way of running multiple bits of logic ‘simultaneously’ without
worrying about threading – runs well on a single thread
▪ Asyncio is Python’s official event loop implementation

Kafka
▪ A simple yet powerful messaging system
▪ A producer client can create a topic in Kafka and write messages to it
▪ Multiple consumer clients can then read these messages in sequence, each
at their own pace
▪ Partitioning of topics - if multiple consumers in the same group, each sees a
distinct subset of partitions of the topic
▪ It's a proper Big Data application with many other nice properties, the only
one that concerns us is that it's designed to deal with lots of data and lots of
clients, fast!
▪ Can spawn an instance locally in seconds, using Docker, eg using the image
at https://hub.docker.com/r/flozano/kafka/

Now let’s put it all together!
▪ Structure your graph as a collection of pull-only subgraphs, that consume
from and publish to multiple Kafka topics
▪ Inside each subgraph, can merge streams; can also choose to send each
message of a stream to one of many sources,
▪ Inside each subgraph, each message goes to at most one downstream!
▪ If two consumers want to consume the same stream, push that stream to
Kafka and let them each read from Kafka at their own pace
▪ If you have a 'hot' source that won't wait: just run a separate process that
just pushes the output of that source into a Kafka topic, then consume at
leisure

Our example streaming graph sliced up according to the
pattern
▪ The ‘active’ nodes are green – exactly one per subgraph
▪ All buffering happens in Kafka, it was built to handle it!

Scaling
▪ Thanks to asyncio, can run multiple subgraphs in the same Python
process and thread, so can in principle have a whole graph in one file (two
if you want one dedicated to user input)
▪ Scale using Kafka partitioning to begin with: for slow subgraphs, spawn
multiple nodes each looking at its own partitions of a topic
▪ If that doesn't help, replace the problematic subgraphs by applications in
other languages/frameworks
▪ So stateful Python nodes and Spark subgraphs can coexist happily,
communicating via Kafka

Example application
▪ To give a nice syntax to users, we implement a thin façade over the
AsyncIterator interface, adding overloading of operators | and >
▪ So a data source is just an async iterator with some operator
overloading on top:
▪ The | operator applies an operator (such as ‘map’) to a source, returning
a new source
▪ The a > b operator creates a coroutine that, when run, will iterate over a
and feed the results to b, a can be an iterable or async iterable
▪ The ‘run’ command asks the event loop to run all its arguments
▪ The kafka interface classes are a bit of syntactic sugar on top of aiokafka

Summary
▪ Pull-driven subgraphs
▪ Asyncio and async iterators to run many subgraphs at once
▪ Kafka to glue it all together (and to the world)
Questions? Comments?
▪ Please feel free to contact me at egor@dagon.ai

Streaming analytics with Python and Kafka

More Related Content

What's hot

Similar to Streaming analytics with Python and Kafka

Recently uploaded

Streaming analytics with Python and Kafka