Streaming analytics with Python and Kafka

E
© MOSAIC SMART DATA 1
Egor Kraev, Head of AI, Mosaic Smart Data
PyData, April 4, 2017
Streaming analytics with asynchronous Python and Kafka
© MOSAIC SMART DATA 2
Overview
▪ This talk will show what streaming graphs are, why you
really want to use them, and what the pitfalls are
▪ It then presents a simple, lightweight, yet reasonably robust
way of structuring your Python code as a streaming graph,
with the help of asyncio and Kafka
© MOSAIC SMART DATA 3
A simple streaming system
▪ The processing nodes are often stateful, need to process the messages
in the correct sequence, and update their internal state after each
message (an exponential average calculator is a basic example)
▪ The graphs often contain cycles, so for example A ->B -> C -> A
▪ The graphs nearly always have some nodes containing and emitting
multiple streams
© MOSAIC SMART DATA 4
Why structure your system as a streaming graph?
▪ Makes the code clearer
▪ Makes the code more granular and testable
▪ Allows for organic scaling
▪ Start out with the whole graph in one file, can gradually split it up to each node
being a microservice with multiple workers
▪ As the system grows, nodes can run in different languages/frameworks
▪ Makes it easier to run the same code on historic and live data
▪ Treating your historical run as replay also solves some realtime problems such
as aligning different streams correctly
© MOSAIC SMART DATA 5
Two key features of a streaming graph framework
1. Language for graph definition
▪ Ideally, the same person who writes the business logic in the processing nodes should
define the graph structure as well
▪ This means the graph definition language must be simple and natural
2. Once the graph is defined, scheduling is an entirely separate, hard
problem
▪ If we have multiple nodes in a complex graph, with branchings, cycles, etc, what order
do we call them in?
▪ Different consumers of the same message stream, consuming at different rates - what to
do?
▪ If one node has multiple inputs, what order does it receive and process them in?
▪ What if an upstream node produces more data than a downstream node can process?
© MOSAIC SMART DATA 6
Popular kinds of scheduling logic
1. Agents
▪ Each node autonomously decides what messages to send
▪ Each node accepts messages sends to it
▪ Logic for buffering and message scheduling needs to be defined in each node
▪ For example, pykka
2. 'Push' approach
▪ First attempt at event-driven systems tends to be ‘push’
▪ For example 'reactive' systems, eg Microsoft’s RXPy
▪ When an external event appears, it’s fed to the entry point node.
▪ Each node processes what it receives, once done, triggers its downstream nodes
▪ Benefit: simpler logic in the nodes; each node must only have a list of its
downsteam nodes to send messages to
© MOSAIC SMART DATA 7
Problems with the Push approach
1. What if the downstream can't cope?
▪ Solution: 'backpressure': downstream nodes are allowed to signal upstream when
they're not coping
▪ That limits the amount of buffering we need to do internally, but can bring its own
problems.
▪ Backpressure needs to be implemented well at framework level, else we end up with a
callback nightmare: each node must have callbacks to both upstream and downstream,
and manage these as well as an internal message buffer (RXPy as example)
▪ Backpressure combined with multiple dowstreams can lead to processing locking up
accidentally,
2. Push does really badly at aligning merging streams
▪ Even if individual streams are ordered, different streams are often out of sync
▪ What if the graph branches and then re-converges, how do we make sure the 'right'
messages from both branches are processed together?
© MOSAIC SMART DATA 8
The Pull approach
▪ Let's turn the problem on its head!
▪ Let's say each node doesn't need to know its downstream, only its
parents.
▪ The execution is controlled by the downmost node. When it's ready, it
requests more messages from its parents
▪ No buffering needed
▪ When streams merge, the merging node is in control, decides which
stream to consume from first
Limitations:
▪ The sources must be able to wait until queried
▪ Has problems with two downstream nodes wanting to consume the
same message stream
© MOSAIC SMART DATA 9
The challenge
I set out to find or create an architecture with the following properties:
▪ Allows realtime processing
▪ All user-defined logic is in Python with fairly simple syntax
▪ Both processing nodes and graph structure
▪ Lightweight approach, thin layer on top of core Python
▪ Can run on a laptop
▪ Scheduling happens transparently to the user
▪ No need to buffer data inside the Python process (unless you want to)
▪ Must scale gracefully to larger data volumes
© MOSAIC SMART DATA 10
What is out there?
▪ In the JVM world, there's no shortage of great streaming systems
▪ Akka Streams: a mature library
▪ Kafka Streams: allows you to treat Kafka logs as database tables, do joins etc
▪ Flink: Stream processing framework that is good at stateful nodes
▪ On the Python side, a couple of frameworks are almost what I want
▪ Google Dataflow only supports streaming Python when running in Google Cloud,
local runner only supports finite datasets
▪ Spark has awesome Python support, but it's basic approach is map-reduce on
steroids, doesn't fit that well with stateful nodes and cyclical graphs
© MOSAIC SMART DATA 11
Collaborative multitasking, event loop, and asyncio
▪ The event loop pattern, collaborative multitasking
▪ An ‘event loop’ keeps track of multiple functions that want to be executed
▪ Each function can signal to it whether it’s ready to execute or waiting for input
▪ The event loop runs the next function until it has nothing to process, it then
surrenders control back to event loop
▪ A great way of running multiple bits of logic ‘simultaneously’ without
worrying about threading – runs well on a single thread
▪ Asyncio is Python’s official event loop implementation
© MOSAIC SMART DATA 12
Kafka
▪ A simple yet powerful messaging system
▪ A producer client can create a topic in Kafka and write messages to it
▪ Multiple consumer clients can then read these messages in sequence, each
at their own pace
▪ Partitioning of topics - if multiple consumers in the same group, each sees a
distinct subset of partitions of the topic
▪ It's a proper Big Data application with many other nice properties, the only
one that concerns us is that it's designed to deal with lots of data and lots of
clients, fast!
▪ Can spawn an instance locally in seconds, using Docker, eg using the image
at https://hub.docker.com/r/flozano/kafka/
© MOSAIC SMART DATA 13
Now let’s put it all together!
▪ Structure your graph as a collection of pull-only subgraphs, that consume
from and publish to multiple Kafka topics
▪ Inside each subgraph, can merge streams; can also choose to send each
message of a stream to one of many sources,
▪ Inside each subgraph, each message goes to at most one downstream!
▪ If two consumers want to consume the same stream, push that stream to
Kafka and let them each read from Kafka at their own pace
▪ If you have a 'hot' source that won't wait: just run a separate process that
just pushes the output of that source into a Kafka topic, then consume at
leisure
© MOSAIC SMART DATA 14
Our example streaming graph sliced up according to the
pattern
▪ The ‘active’ nodes are green – exactly one per subgraph
▪ All buffering happens in Kafka, it was built to handle it!
© MOSAIC SMART DATA 15
Scaling
▪ Thanks to asyncio, can run multiple subgraphs in the same Python
process and thread, so can in principle have a whole graph in one file (two
if you want one dedicated to user input)
▪ Scale using Kafka partitioning to begin with: for slow subgraphs, spawn
multiple nodes each looking at its own partitions of a topic
▪ If that doesn't help, replace the problematic subgraphs by applications in
other languages/frameworks
▪ So stateful Python nodes and Spark subgraphs can coexist happily,
communicating via Kafka
© MOSAIC SMART DATA 16
Example application
▪ To give a nice syntax to users, we implement a thin façade over the
AsyncIterator interface, adding overloading of operators | and >
▪ So a data source is just an async iterator with some operator
overloading on top:
▪ The | operator applies an operator (such as ‘map’) to a source, returning
a new source
▪ The a > b operator creates a coroutine that, when run, will iterate over a
and feed the results to b, a can be an iterable or async iterable
▪ The ‘run’ command asks the event loop to run all its arguments
▪ The kafka interface classes are a bit of syntactic sugar on top of aiokafka
© MOSAIC SMART DATA 17
Summary
▪ Pull-driven subgraphs
▪ Asyncio and async iterators to run many subgraphs at once
▪ Kafka to glue it all together (and to the world)
Questions? Comments?
▪ Please feel free to contact me at egor@dagon.ai
1 of 17

Recommended

Pulsar Storage on BookKeeper _Seamless Evolution by
Pulsar Storage on BookKeeper _Seamless EvolutionPulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless EvolutionStreamNative
2.6K views19 slides
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014 by
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014Chen-en Lu
9.1K views58 slides
Apache Pulsar First Overview by
Apache PulsarFirst OverviewApache PulsarFirst Overview
Apache Pulsar First OverviewRicardo Paiva
1.3K views28 slides
gRPC Design and Implementation by
gRPC Design and ImplementationgRPC Design and Implementation
gRPC Design and ImplementationVarun Talwar
10.6K views20 slides
kafka by
kafkakafka
kafkaAmikam Snir
1K views23 slides
Copy of Kafka-Camus by
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-CamusDeep Shah
665 views15 slides

More Related Content

What's hot

Open stack HA - Theory to Reality by
Open stack HA -  Theory to RealityOpen stack HA -  Theory to Reality
Open stack HA - Theory to RealitySriram Subramanian
2.1K views37 slides
Current and Future of Apache Kafka by
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache KafkaJoe Stein
9K views33 slides
High performance messaging with Apache Pulsar by
High performance messaging with Apache PulsarHigh performance messaging with Apache Pulsar
High performance messaging with Apache PulsarMatteo Merli
5.8K views48 slides
Architecture of a Kafka camus infrastructure by
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructuremattlieber
12.6K views18 slides
Bookie storage - Apache BookKeeper Meetup - 2015-06-28 by
Bookie storage - Apache BookKeeper Meetup - 2015-06-28 Bookie storage - Apache BookKeeper Meetup - 2015-06-28
Bookie storage - Apache BookKeeper Meetup - 2015-06-28 Matteo Merli
879 views19 slides
Apache Kafka at LinkedIn by
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedInDiscover Pinterest
4.6K views37 slides

What's hot(20)

Current and Future of Apache Kafka by Joe Stein
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
Joe Stein9K views
High performance messaging with Apache Pulsar by Matteo Merli
High performance messaging with Apache PulsarHigh performance messaging with Apache Pulsar
High performance messaging with Apache Pulsar
Matteo Merli5.8K views
Architecture of a Kafka camus infrastructure by mattlieber
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructure
mattlieber12.6K views
Bookie storage - Apache BookKeeper Meetup - 2015-06-28 by Matteo Merli
Bookie storage - Apache BookKeeper Meetup - 2015-06-28 Bookie storage - Apache BookKeeper Meetup - 2015-06-28
Bookie storage - Apache BookKeeper Meetup - 2015-06-28
Matteo Merli879 views
Spark on Kubernetes by datamantra
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
datamantra1.5K views
Apache Kafka - Martin Podval by Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
Martin Podval3.4K views
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ... by confluent
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
confluent1.1K views
Large scale log pipeline using Apache Pulsar_Nozomi by StreamNative
Large scale log pipeline using Apache Pulsar_NozomiLarge scale log pipeline using Apache Pulsar_Nozomi
Large scale log pipeline using Apache Pulsar_Nozomi
StreamNative3K views
How Orange Financial combat financial frauds over 50M transactions a day usin... by JinfengHuang3
How Orange Financial combat financial frauds over 50M transactions a day usin...How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...
JinfengHuang332 views
CockroachDB: Architecture of a Geo-Distributed SQL Database by C4Media
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
C4Media968 views
A la rencontre de Kafka, le log distribué par Florian GARCIA by La Cuisine du Web
A la rencontre de Kafka, le log distribué par Florian GARCIAA la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIA
La Cuisine du Web768 views
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013 by mumrah
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah61.2K views
[March sn meetup] apache pulsar + apache nifi for cloud data lake by Timothy Spann
[March sn meetup] apache pulsar + apache nifi for cloud data lake[March sn meetup] apache pulsar + apache nifi for cloud data lake
[March sn meetup] apache pulsar + apache nifi for cloud data lake
Timothy Spann903 views
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes by HBaseCon
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
HBaseCon3.9K views
Building event streaming pipelines using Apache Pulsar by StreamNative
Building event streaming pipelines using Apache PulsarBuilding event streaming pipelines using Apache Pulsar
Building event streaming pipelines using Apache Pulsar
StreamNative291 views
Mystery Machine Overview by Ivan Glushkov
Mystery Machine OverviewMystery Machine Overview
Mystery Machine Overview
Ivan Glushkov548 views
I Heart Log: Real-time Data and Apache Kafka by Jay Kreps
I Heart Log: Real-time Data and Apache KafkaI Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache Kafka
Jay Kreps10.5K views

Similar to Streaming analytics with Python and Kafka

Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka by
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit
867 views42 slides
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka by
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
636 views48 slides
Stream processing using Kafka by
Stream processing using KafkaStream processing using Kafka
Stream processing using KafkaKnoldus Inc.
1.6K views44 slides
Stream, stream, stream: Different streaming methods with Spark and Kafka by
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaItai Yaffe
555 views50 slides
Introduction to Apache Flink by
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkdatamantra
5.2K views33 slides
Kafka internals by
Kafka internalsKafka internals
Kafka internalsDavid Groozman
6K views53 slides

Similar to Streaming analytics with Python and Kafka(20)

Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka by DataWorks Summit
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit867 views
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka by Databricks
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks636 views
Stream processing using Kafka by Knoldus Inc.
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
Knoldus Inc.1.6K views
Stream, stream, stream: Different streaming methods with Spark and Kafka by Itai Yaffe
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
Itai Yaffe555 views
Introduction to Apache Flink by datamantra
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
datamantra5.2K views
Apache frameworks for Big and Fast Data by Naveen Korakoppa
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
Naveen Korakoppa423 views
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka by Hua Chu
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
Hua Chu2.3K views
Real time processing of trade data with kafka, spark streaming and aerospike ... by Mich Talebzadeh (Ph.D.)
Real time processing of trade data with kafka, spark streaming and aerospike ...Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ... by Mich Talebzadeh (Ph.D.)
Real time processing of trade data with kafka, spark streaming and aerospike ...Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St... by Athens Big Data
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
Athens Big Data200 views
Snabb Switch: Riding the HPC wave to simpler, better network appliances (FOSD... by Igalia
Snabb Switch: Riding the HPC wave to simpler, better network appliances (FOSD...Snabb Switch: Riding the HPC wave to simpler, better network appliances (FOSD...
Snabb Switch: Riding the HPC wave to simpler, better network appliances (FOSD...
Igalia663 views
Architecting and productionising data science applications at scale by samthemonad
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
samthemonad201 views
Stateful stream processing with kafka and samza by George Li
Stateful stream processing with kafka and samzaStateful stream processing with kafka and samza
Stateful stream processing with kafka and samza
George Li888 views
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of... by Data Con LA
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Data Con LA1.5K views
Spark Streaming & Kafka-The Future of Stream Processing by Jack Gudenkauf
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf1.8K views

Recently uploaded

[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P... by
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...DataScienceConferenc1
8 views36 slides
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf by
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf10urkyr34
6 views259 slides
[DSC Europe 23] Ivana Sesic - Use of AI in Public Health.pptx by
[DSC Europe 23] Ivana Sesic - Use of AI in Public Health.pptx[DSC Europe 23] Ivana Sesic - Use of AI in Public Health.pptx
[DSC Europe 23] Ivana Sesic - Use of AI in Public Health.pptxDataScienceConferenc1
5 views15 slides
Listed Instruments Survey 2022.pptx by
Listed Instruments Survey  2022.pptxListed Instruments Survey  2022.pptx
Listed Instruments Survey 2022.pptxsecretariat4
31 views12 slides
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ... by
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...DataScienceConferenc1
9 views18 slides
CRIJ4385_Death Penalty_F23.pptx by
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptxyvettemm100
7 views24 slides

Recently uploaded(20)

[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P... by DataScienceConferenc1
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf by 10urkyr34
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
10urkyr346 views
Listed Instruments Survey 2022.pptx by secretariat4
Listed Instruments Survey  2022.pptxListed Instruments Survey  2022.pptx
Listed Instruments Survey 2022.pptx
secretariat431 views
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ... by DataScienceConferenc1
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...
CRIJ4385_Death Penalty_F23.pptx by yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1007 views
Ukraine Infographic_22NOV2023_v2.pdf by AnastosiyaGurin
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdf
AnastosiyaGurin1.4K views
[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf by DataScienceConferenc1
[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf
[DSC Europe 23] Ales Gros - Quantum and Today s security with Quantum.pdf
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an... by StatsCommunications
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
PRIVACY AWRE PERSONAL DATA STORAGE by antony420421
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGE
antony4204217 views
Short Story Assignment by Kelly Nguyen by kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0119 views
LIVE OAK MEMORIAL PARK.pptx by ms2332always
LIVE OAK MEMORIAL PARK.pptxLIVE OAK MEMORIAL PARK.pptx
LIVE OAK MEMORIAL PARK.pptx
ms2332always7 views
Dr. Ousmane Badiane-2023 ReSAKSS Conference by AKADEMIYA2063
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS Conference
AKADEMIYA20635 views
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx by DataScienceConferenc1
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ... by DataScienceConferenc1
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20047 views

Streaming analytics with Python and Kafka

  • 1. © MOSAIC SMART DATA 1 Egor Kraev, Head of AI, Mosaic Smart Data PyData, April 4, 2017 Streaming analytics with asynchronous Python and Kafka
  • 2. © MOSAIC SMART DATA 2 Overview ▪ This talk will show what streaming graphs are, why you really want to use them, and what the pitfalls are ▪ It then presents a simple, lightweight, yet reasonably robust way of structuring your Python code as a streaming graph, with the help of asyncio and Kafka
  • 3. © MOSAIC SMART DATA 3 A simple streaming system ▪ The processing nodes are often stateful, need to process the messages in the correct sequence, and update their internal state after each message (an exponential average calculator is a basic example) ▪ The graphs often contain cycles, so for example A ->B -> C -> A ▪ The graphs nearly always have some nodes containing and emitting multiple streams
  • 4. © MOSAIC SMART DATA 4 Why structure your system as a streaming graph? ▪ Makes the code clearer ▪ Makes the code more granular and testable ▪ Allows for organic scaling ▪ Start out with the whole graph in one file, can gradually split it up to each node being a microservice with multiple workers ▪ As the system grows, nodes can run in different languages/frameworks ▪ Makes it easier to run the same code on historic and live data ▪ Treating your historical run as replay also solves some realtime problems such as aligning different streams correctly
  • 5. © MOSAIC SMART DATA 5 Two key features of a streaming graph framework 1. Language for graph definition ▪ Ideally, the same person who writes the business logic in the processing nodes should define the graph structure as well ▪ This means the graph definition language must be simple and natural 2. Once the graph is defined, scheduling is an entirely separate, hard problem ▪ If we have multiple nodes in a complex graph, with branchings, cycles, etc, what order do we call them in? ▪ Different consumers of the same message stream, consuming at different rates - what to do? ▪ If one node has multiple inputs, what order does it receive and process them in? ▪ What if an upstream node produces more data than a downstream node can process?
  • 6. © MOSAIC SMART DATA 6 Popular kinds of scheduling logic 1. Agents ▪ Each node autonomously decides what messages to send ▪ Each node accepts messages sends to it ▪ Logic for buffering and message scheduling needs to be defined in each node ▪ For example, pykka 2. 'Push' approach ▪ First attempt at event-driven systems tends to be ‘push’ ▪ For example 'reactive' systems, eg Microsoft’s RXPy ▪ When an external event appears, it’s fed to the entry point node. ▪ Each node processes what it receives, once done, triggers its downstream nodes ▪ Benefit: simpler logic in the nodes; each node must only have a list of its downsteam nodes to send messages to
  • 7. © MOSAIC SMART DATA 7 Problems with the Push approach 1. What if the downstream can't cope? ▪ Solution: 'backpressure': downstream nodes are allowed to signal upstream when they're not coping ▪ That limits the amount of buffering we need to do internally, but can bring its own problems. ▪ Backpressure needs to be implemented well at framework level, else we end up with a callback nightmare: each node must have callbacks to both upstream and downstream, and manage these as well as an internal message buffer (RXPy as example) ▪ Backpressure combined with multiple dowstreams can lead to processing locking up accidentally, 2. Push does really badly at aligning merging streams ▪ Even if individual streams are ordered, different streams are often out of sync ▪ What if the graph branches and then re-converges, how do we make sure the 'right' messages from both branches are processed together?
  • 8. © MOSAIC SMART DATA 8 The Pull approach ▪ Let's turn the problem on its head! ▪ Let's say each node doesn't need to know its downstream, only its parents. ▪ The execution is controlled by the downmost node. When it's ready, it requests more messages from its parents ▪ No buffering needed ▪ When streams merge, the merging node is in control, decides which stream to consume from first Limitations: ▪ The sources must be able to wait until queried ▪ Has problems with two downstream nodes wanting to consume the same message stream
  • 9. © MOSAIC SMART DATA 9 The challenge I set out to find or create an architecture with the following properties: ▪ Allows realtime processing ▪ All user-defined logic is in Python with fairly simple syntax ▪ Both processing nodes and graph structure ▪ Lightweight approach, thin layer on top of core Python ▪ Can run on a laptop ▪ Scheduling happens transparently to the user ▪ No need to buffer data inside the Python process (unless you want to) ▪ Must scale gracefully to larger data volumes
  • 10. © MOSAIC SMART DATA 10 What is out there? ▪ In the JVM world, there's no shortage of great streaming systems ▪ Akka Streams: a mature library ▪ Kafka Streams: allows you to treat Kafka logs as database tables, do joins etc ▪ Flink: Stream processing framework that is good at stateful nodes ▪ On the Python side, a couple of frameworks are almost what I want ▪ Google Dataflow only supports streaming Python when running in Google Cloud, local runner only supports finite datasets ▪ Spark has awesome Python support, but it's basic approach is map-reduce on steroids, doesn't fit that well with stateful nodes and cyclical graphs
  • 11. © MOSAIC SMART DATA 11 Collaborative multitasking, event loop, and asyncio ▪ The event loop pattern, collaborative multitasking ▪ An ‘event loop’ keeps track of multiple functions that want to be executed ▪ Each function can signal to it whether it’s ready to execute or waiting for input ▪ The event loop runs the next function until it has nothing to process, it then surrenders control back to event loop ▪ A great way of running multiple bits of logic ‘simultaneously’ without worrying about threading – runs well on a single thread ▪ Asyncio is Python’s official event loop implementation
  • 12. © MOSAIC SMART DATA 12 Kafka ▪ A simple yet powerful messaging system ▪ A producer client can create a topic in Kafka and write messages to it ▪ Multiple consumer clients can then read these messages in sequence, each at their own pace ▪ Partitioning of topics - if multiple consumers in the same group, each sees a distinct subset of partitions of the topic ▪ It's a proper Big Data application with many other nice properties, the only one that concerns us is that it's designed to deal with lots of data and lots of clients, fast! ▪ Can spawn an instance locally in seconds, using Docker, eg using the image at https://hub.docker.com/r/flozano/kafka/
  • 13. © MOSAIC SMART DATA 13 Now let’s put it all together! ▪ Structure your graph as a collection of pull-only subgraphs, that consume from and publish to multiple Kafka topics ▪ Inside each subgraph, can merge streams; can also choose to send each message of a stream to one of many sources, ▪ Inside each subgraph, each message goes to at most one downstream! ▪ If two consumers want to consume the same stream, push that stream to Kafka and let them each read from Kafka at their own pace ▪ If you have a 'hot' source that won't wait: just run a separate process that just pushes the output of that source into a Kafka topic, then consume at leisure
  • 14. © MOSAIC SMART DATA 14 Our example streaming graph sliced up according to the pattern ▪ The ‘active’ nodes are green – exactly one per subgraph ▪ All buffering happens in Kafka, it was built to handle it!
  • 15. © MOSAIC SMART DATA 15 Scaling ▪ Thanks to asyncio, can run multiple subgraphs in the same Python process and thread, so can in principle have a whole graph in one file (two if you want one dedicated to user input) ▪ Scale using Kafka partitioning to begin with: for slow subgraphs, spawn multiple nodes each looking at its own partitions of a topic ▪ If that doesn't help, replace the problematic subgraphs by applications in other languages/frameworks ▪ So stateful Python nodes and Spark subgraphs can coexist happily, communicating via Kafka
  • 16. © MOSAIC SMART DATA 16 Example application ▪ To give a nice syntax to users, we implement a thin façade over the AsyncIterator interface, adding overloading of operators | and > ▪ So a data source is just an async iterator with some operator overloading on top: ▪ The | operator applies an operator (such as ‘map’) to a source, returning a new source ▪ The a > b operator creates a coroutine that, when run, will iterate over a and feed the results to b, a can be an iterable or async iterable ▪ The ‘run’ command asks the event loop to run all its arguments ▪ The kafka interface classes are a bit of syntactic sugar on top of aiokafka
  • 17. © MOSAIC SMART DATA 17 Summary ▪ Pull-driven subgraphs ▪ Asyncio and async iterators to run many subgraphs at once ▪ Kafka to glue it all together (and to the world) Questions? Comments? ▪ Please feel free to contact me at egor@dagon.ai